32 vs 1024 Bit Processor

Michael Bringle

unread,

Jan 30, 1996, 3:00:00 AM1/30/96

to

I was wondering, barring compatability, what are the manafacturing or
performance restraints that would keep us from jumping, say to a 1024
bit processor? Or more gernerally, why don't we just start using very
wide data pathways to speed up our component?

Michael Bringle

Mike Schmit

unread,

Jan 31, 1996, 3:00:00 AM1/31/96

to

In <310ead63...@newsstand.cit.cornell.edu> mp...@cornell.edu

The first problem is that you would need a very wide data path... i.e.
1024 wires. To get all the performance out of it, you would need 1024
pins just for data. Current chips have 200 - 300 pins. This is just a
mechanical/manufacturing problem in one sense. As soon as you multiplex
the data lines or have a smaller path from L1 cache to memory, etc. you
are giving up some performance.

Another problem is that not all applications scale, usefully with
larger
ALU sizes, thus you would be wasting lots of the transistors a large
percentage of the time.

I'm sure there are other problems that make such a CPU not economical
for general purpose computing. I'm sure that someone has their special
application that this would be perfect for, though.

Mike Schmit

-------------------------------------------------------------------
msc...@ix.netcom.com author:
408-244-6826 Pentium Processor Programming Tools
800-765-8086 ISBN: 0-12-627230-1
-------------------------------------------------------------------

Ethan L. Miller

unread,

Jan 31, 1996, 3:00:00 AM1/31/96

to

>>>>> "Michael" == Michael Bringle <mp...@cornell.edu> writes:

Michael> I was wondering, barring compatability, what are the
Michael> manafacturing or performance restraints that would keep us
Michael> from jumping, say to a 1024 bit processor? Or more
Michael> gernerally, why don't we just start using very wide data
Michael> pathways to speed up our component?

There are several problems. First, 1024 bit addresses and integers
take up lots more space in both memory and CPU. It's easy to put 2048
bits of registers on a chip - that's enough for 32 64-bit registers or
two 1024-bit registers. DRAM usage also goes up, since all pointers
take up more space.

Beyond those issues, there's also the problem of fitting the ALU on
the chip. It's not all that hard to design a 1024-bit ALU (well, the
carry lookahead logic could be difficult...), but it's hard to fit it
onto a die we can build today. We're also pin-limited - modern chips
have a few hundred pins, not the few thousand you'd need for a 1024
bit processor.

I'm sure it'll be possible eventually, but it's not feasible right
now.

ethan

--
( Prof. Ethan L. Miller voice: +1 410 455-3972 fax: 455-3969 )
( U of Maryland Baltimore County email: e...@cs.umbc.edu )
( CSEE Dept, 5401 Wilkens Ave URL: http://www.cs.umbc.edu/~elm/ )
( Baltimore, MD 21228-5398 USA These opinions are mine, all mine! )
( PGP key fingerprint: 86 45 87 10 D7 18 35 6A 3E ED 0F B7 99 53 4E 4A )

Stefan Monnier

unread,

Jan 31, 1996, 3:00:00 AM1/31/96

to

In article <310ead63...@newsstand.cit.cornell.edu>,
Michael Bringle <mp...@cornell.edu> wrote:
] I was wondering, barring compatability, what are the manafacturing or
] performance restraints that would keep us from jumping, say to a 1024
] bit processor? Or more gernerally, why don't we just start using very
] wide data pathways to speed up our component?

I don't think the problem is merely technological.
It's much more likely to be:

what would you do with those 1024bits ?
what's the point ?

if you can find a way to take advantage of a 1024bits datapath in more than a
few special cases (like cryptography), maybe you could convince people to start
thinking about it.

Stefan

John R. Mashey

unread,

Jan 31, 1996, 3:00:00 AM1/31/96

to

In article <4eo5d5$g...@info.epfl.ch>, "Stefan Monnier" <stefan....@lia.di.epfl.ch> writes:

Note: I think an N-bit CPU has N-bit wide integer registers and datapath.

1) In an R10000, the 64-bit-wide integer datapath is about 20% of the width.
1024 is 16X larger. To get a 1024-bit datapath to be the same fraction of
the width of a same size chip, you only need about 10 shrinks, i.e. 10 chip generations; assuming 3 years apiece, that's 30 years before you'd even want
to think about this. If you were willing for it to be a larger fraction of
a chip, you might save 2 generations.
(You can jiggle the numbers, but that's the idea.)

2) Of course, since wires often don't shrink as fast a transistors, there is
the "minor" issue of running a lot of 1024-bit wide busses around the chip.

3) Besides the space issue, consider that designers are fighting hard to
reduce the delays caused by long wires on chips ... and are not likely to
be thrilled by needing to do 1024-bit-wide adders and shifters.

4) And as Stefan notes, you need a *good reason* to even think about
it:

I've lost the posting, but we went thru this last year, but just discussing
when *128*-bit could come in. Since we're right on the 32/64-bit boundary,
and 64 bits addressing has added 32 more bits, and you can argue that we
consume 2 bits every 3 years (to track DRAM), that's 3*32/2, or 48 years.
For various reasons, I've predicted that somebody would do it earlier,
maybe around 2020 or 2030, assuming current growth rates.

Put another way, about the time you *might* consider 1024 to be possible,
you'll be *thinking* about doing 128.

BOTTOM LINE:
a) 32-bit CPUs are already insufficient for some uses.
b) 64-bit CPUs are likely to be sufficient for a *long time*;
Note that there are already <$35 64-bit micros available, so this
is not exotica.

--
-john mashey DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP: ma...@sgi.com
DDD: 415-933-3090 FAX: 415-967-8496
USPS: Silicon Graphics 6L-005, 2011 N. Shoreline Blvd, Mountain View, CA 94039-7311

R. Edward Nather

unread,

Feb 1, 1996, 3:00:00 AM2/1/96

to

ma...@mash.engr.sgi.com (John R. Mashey) wrote:
>In article <4eo5d5$g...@info.epfl.ch>, "Stefan Monnier" <stefan....@lia.di.epfl.ch> writes:
>
>|> I don't think the problem is merely technological.
>|> It's much more likely to be:
>|>
>|> what would you do with those 1024bits ?
>|> what's the point ?
>|>
>|> if you can find a way to take advantage of a 1024bits datapath in more than a
>|> few special cases (like cryptography), maybe you could convince people to start
>|> thinking about it.
>
>Note: I think an N-bit CPU has N-bit wide integer registers and datapath.
>
>

>4) And as Stefan notes, you need a *good reason* to even think about
>it:
>

I wonder how much stuff that now requires floating point operations
could be done using 1024-bit integers? If you could do most of it
that way, you could save chip real estate and execution time, it seems
to me. Floating point was, after all, an expedient devised to overcome
the limitations of short register operations, and still has the disadvantage
that "It's like moving piles of sand around: every time you do, you lose
a little sand, and pick up a little dirt."

ed

Kevin D. Kissell

unread,

Feb 1, 1996, 3:00:00 AM2/1/96

to kev...@neu.sgi.com

Stefan Monnier wrote:
>
> In article <310ead63...@newsstand.cit.cornell.edu>,
> Michael Bringle <mp...@cornell.edu> wrote:
> ] I was wondering, barring compatability, what are the manafacturing or
> ] performance restraints that would keep us from jumping, say to a 1024
> ] bit processor? Or more gernerally, why don't we just start using very
> ] wide data pathways to speed up our component?
>

> I don't think the problem is merely technological.
> It's much more likely to be:
>
> what would you do with those 1024bits ?
> what's the point ?
>
> if you can find a way to take advantage of a 1024bits datapath in more
> than a few special cases (like cryptography), maybe you could convince
> people to start thinking about it.

Hey, I *like* thinking about "obviously" silly ideas. You never know
where they might lead.

I'm pretty sure I read some articles 3-4 years ago about a machine built at
a university in the UK which had something on the order of a 1024 bit ALU,
and yes, it was principally interesting for crypto factoring problems, on
which it was remarkably fast, even given it's very slow clock rate. Does
anyone remember the name of the machine or it's designers?

Several other respondents to this thread have pointed out that 1024 bits
of addressing are excessive, and that 1024-bit data paths are presently
very difficult to do on-chip and impossible to do off-chip. However, I
reflect that a classic Cray vector register *could* be thought of as a
1024-bit register. Already, with 32 or 64-bit registers, microprocessor
designers are looking at ways, sometimes called "visual" or "graphics"
instructions, to perform vector-like operations on agregates of 8 or 16-bit
data. If we can map such hardware into useful languages - and it should be
possible, given that vectorisation algorithms already exist - we might see
a trend toward larger registers on the 1024-bit order, feeding configurable
ALUs that can perform some number of 8/16/32/64/128(?)-bit operations in
parallel. As the efficiency of such a scheme would be a function of the
regularity of the structure of the computation to be performed, and as
numerical codes show the most regularity, it would be more valuable to
support ganged floating-point operations than ganged integer operations.

I suppose that the principal difference between what I describe and a Cray
is that, in the Cray designs, the vector registers were really piped to
functional units one element at a time, whereas in a neo-vector machine,
one would be processing multiple elements in parallel on the same clock.
The data memory bandwidth requirements would be correspondingly higher.

--
Opinions expressed may not be Kevin D. Kissell
those of the author, let alone Silicon Graphics Core Technology Group
those of Silicon Graphics. Cortaillod, Switzerland

kev...@neu.sgi.com

David Shepherd

unread,

Feb 2, 1996, 3:00:00 AM2/2/96

to

R. Edward Nather (nat...@astro.as.utexas.edu) wrote:
: I wonder how much stuff that now requires floating point operations

: could be done using 1024-bit integers? If you could do most of it
: that way, you could save chip real estate and execution time, it seems
: to me. Floating point was, after all, an expedient devised to overcome
: the limitations of short register operations, and still has the disadvantage
: that "It's like moving piles of sand around: every time you do, you lose
: a little sand, and pick up a little dirt."

Its been done before - while not quite with 1024 bit registers
there have been systems that used huge "super-accumulators" that
allowed lower precision subresults to be accumulated in a very
high precision super-accumulator. Not entirely sure how much of this
was hardware supported.

--
--------------------------------------------------------------------------
david shepherd
SGS-THOMSON Microelectronics Ltd, 1000 aztec west, bristol bs12 4sq, u.k.
tel/fax: +44 1454 611638/617910 email: d...@bristol.st.com
"whatever you don't want, you don't want negative advertising"

Alberto C Moreira

unread,

Feb 2, 1996, 3:00:00 AM2/2/96

to

msc...@ix.netcom.com(Mike Schmit ) wrote:

>In <310ead63...@newsstand.cit.cornell.edu> mp...@cornell.edu
>(Michael Bringle) writes:

>>I was wondering, barring compatability, what are the manafacturing or
>>performance restraints that would keep us from jumping, say to a 1024
>>bit processor? Or more gernerally, why don't we just start using very
>>wide data pathways to speed up our component?

>The first problem is that you would need a very wide data path... i.e.

>1024 wires. To get all the performance out of it, you would need 1024
>pins just for data. Current chips have 200 - 300 pins. This is just a
>mechanical/manufacturing problem in one sense. As soon as you multiplex
>the data lines or have a smaller path from L1 cache to memory, etc. you
>are giving up some performance.

Assuming packaging issues can be negotiated, a 1024-bit cpu would be a
very attractive component of a parallel machine, say, a hypercube.
Forget slow memory, put as much static ram as possible, and implement
a few parallel operations in hardware, for example, Guy Blelloch's
scans. Even better if the processor chip could include a
reconfiguration coprocessor ala Maresca and Li. Such machine would run
circles around today's single-cpu systems.

>Another problem is that not all applications scale, usefully with
>larger ALU sizes, thus you would be wasting lots of the transistors a large
>percentage of the time.

Parallel programming has evolved a lot these last years. We now have
lots of very fast algorithms to do all sort of things; you can start
with JaJa or Leighton, and go from there to more modern papers.

>I'm sure there are other problems that make such a CPU not economical
>for general purpose computing. I'm sure that someone has their special
>application that this would be perfect for, though.

The limitation as of today isn't scientific, but technological. My
company's video chip, for example, has a 128-bit bus to its VRAMs.
Somebody else is talking about a 192-bit bus. As technology advances,
packaging will necessarily follow the greater miniaturization trends.
A few years ago an 80-pin chip was a tour-de-force; today a 304-pin
chip is commonplace, and 500-or-so isn't far away. As technology
evolves, computing is bound to move towards greater parallelism at
chip level, and slowly evolve out of one-cpu systems.

_alberto_

Michel Hack

unread,

Feb 2, 1996, 3:00:00 AM2/2/96

to

The IBM S/370 supports an optional 'high accuracy arithmetic' feature (it is
actually standard on the 9377). This provides a set of machine instructions
which operate on 256-byte memory-mapped accumulators and regular floating-point
registers. An accumulator contains some flag bits, and a 1312-bit unsigned
binary accumulator representing the absolute value in units of 2**-748; the
largest representable value is thus 2**564-2**-748. This is sufficient to
accumulate squares of the largest IBM-format floaters, or square-roots of the
smallest unnormalised extended-precision floater (2**-319), with room to spare.

This feature permits *exact* floating-point computations in the absence of
divisions and square roots, and absolute (not relative) error bounds on other
operations.

I don't know of any applications that use the feature; you'd have to find out
from somebody else. (My interest lies in machine architecture.)

Michel.

Krste Asanovic

unread,

Feb 2, 1996, 3:00:00 AM2/2/96

to

In article <31111D...@neu.sgi.com>, "Kevin D. Kissell" <kev...@neu.sgi.com> writes:
|> Several other respondents to this thread have pointed out that 1024 bits
|> of addressing are excessive, and that 1024-bit data paths are presently
|> very difficult to do on-chip and impossible to do off-chip. However, I
|> reflect that a classic Cray vector register *could* be thought of as a
|> 1024-bit register. Already, with 32 or 64-bit registers, microprocessor

^^^^^^^^
Actually, 64 elements of 64b is 4096 bits. On the C90, max vector
length went up to 128 elements, so 8192 bits total.

|> Several other respondents to this thread have pointed out that 1024 bits
|> of addressing are excessive, and that 1024-bit data paths are presently
|> very difficult to do on-chip and impossible to do off-chip. However, I

Internal DRAM buses (bitlines->sense amps) are on this order, or even
larger. Lots of potential bandwidth lurking about inside.

|> reflect that a classic Cray vector register *could* be thought of as a
|> 1024-bit register. Already, with 32 or 64-bit registers, microprocessor
|> designers are looking at ways, sometimes called "visual" or "graphics"
|> instructions, to perform vector-like operations on agregates of 8 or 16-bit
|> data. If we can map such hardware into useful languages - and it should be
|> possible, given that vectorisation algorithms already exist - we might see
|> a trend toward larger registers on the 1024-bit order, feeding configurable
|> ALUs that can perform some number of 8/16/32/64/128(?)-bit operations in
|> parallel. As the efficiency of such a scheme would be a function of the

The media processor from Microunity already does this up to 128b,
including 1/2/4b ops too. See URL: http://www.microunity.com/

|> regularity of the structure of the computation to be performed, and as
|> numerical codes show the most regularity, it would be more valuable to
|> support ganged floating-point operations than ganged integer operations.

Where there's a lot other than numerical codes which show regularity.
Graphics perhaps? :-)

|> I suppose that the principal difference between what I describe and a Cray
|> is that, in the Cray designs, the vector registers were really piped to
|> functional units one element at a time, whereas in a neo-vector machine,
|> one would be processing multiple elements in parallel on the same clock.

No difference really, except that your instruction issue rate you need
to saturate your machine when operating on long vectors, in
instructions per clock cycle is:

N_ipc = F * W / L

Where F is the number of parallel functional units (say mul, add,
memory pipe), and W is the width of each functional unit (say 2-8
parallel pipelines per functional unit), and L is the maximum vector
length.

For example, a Cray C90 has F = ~11 (not sure on exact number here,
but I'd imagine it's rare to use more than 5-6 vector functional units
simultaneously (2 load, 1 store, 1 mul, 1 add) on a C90), W = 2, L =
128 so

N_ipc = 11 * 2 / 128 = 0.17

T0 (http://www.icsi.berkeley.edu/real/spert/t0-intro.html) has F = 3,
W = 8, L = 32, giving a minimum instruction issue rate of

N_ipc = 3 * 8 / 32 = 0.75.

So, roughly 3/4 of a single issue machine to keep all three pipes
busy. T0 has a 256b internal datapath split into 8x32b slices (and
about 5-6x that number of bus wires running through the datapath).

As the W/L increases, so does the instruction issue bandwidth to keep
your vector machine fed.

|> The data memory bandwidth requirements would be correspondingly higher.

Well, yes, but only cause you're actually doing more work!

--
Krste Asanovic phone: +1 (510) 642-4274 x143
International Computer Science Institute fax: +1 (510) 643-7684
Suite 600, 1947 Center Street email: kr...@icsi.berkeley.edu
Berkeley, CA 94704-1198, USA http://www.icsi.berkeley.edu/~krste

Brian V. McGroarty

unread,

Feb 3, 1996, 3:00:00 AM2/3/96

to

My initial reaction is that almost anything which would benefit from a 1024
bit data architecture would benefit similarly from a more versatile SIMD
architecture.

---
Brian Valters McGroarty

tlee

unread,

Feb 5, 1996, 3:00:00 AM2/5/96

to

In article <4eop6d$f...@murrow.corp.sgi.com>, ma...@mash.engr.sgi.com (John
R. Mashey) wrote:

> In article <4eo5d5$g...@info.epfl.ch>, "Stefan Monnier"
<stefan....@lia.di.epfl.ch> writes:
>

> |> I don't think the problem is merely technological.
> |> It's much more likely to be:
> |>

> BOTTOM LINE:
> a) 32-bit CPUs are already insufficient for some uses.
> b) 64-bit CPUs are likely to be sufficient for a *long time*;
> Note that there are already <$35 64-bit micros available, so this
> is not exotica.
>

There are couple of apps can use 1024 bits ALU or FP today.

1) Multimedia - The higher the data bandwidth, the better.
1024 ALU can help on encode/decode an image.

2) 3D graphics
a) A point in 3D space is 3 * 64 Bits.
b) A matrix to operated on the points is 4x4x64 = 1024 bits - This
fits well.

If I am designing the Multimedia co-processor, I would use 1024 bits ALU,
FP, and internal datapath.

For external datapath, I would use 64 bits SDRAM or Burst Mode EDO DRAM
to get the bandwidth for the system.

As far as instruction set, they would be multimedia or 3D instruction set or
microcodes the optimize for 1024 bits datapath operation.

If you're interested in design something like this as a garage project,
email me. We can get together and write a proposal to Xilinx to get some
funding for the hardware. Here's how I would proceed with the
project:

The software architect (me?):
1) Start getting JPEG, MPEG, 3D algorithms/source code.
I had them already.
2) Compile it.
3) Profile the code and do proformance analysis on it.

The hardware architect (you?)
1) Write the proposal to xilinx to get the funding from their
re-configurable computing program.

2) Build one or more PCI based XILINX based board stuff with
a bunch high I/Os, high CLB chips - XC4013 or better.
We would probably put some FP co-processors in the system
because the XILINX stuff is not good for FP operation.

3) (Both of us?) Designed the system/environment for rapid
prototyping the
instruction sets and system for the multimedia co-processor
or a multimedia co-processor system.

At this point of the project, we will get together to analysis the
cost/proformance of the system and see if it can form a technology base
for a viable business enterprise.

If yes, we can try to can some private or venture funding for
the venture and proceed with building the company.

If not, we can have a beer, pat each other on the back for a job
well done, put the experience on a report and attach it to our resumes.
We can also put outline of the design on a web page or write an article
for EE Time, EDN or Computer Design and see if any big company want
to buy the whole design and put it on their system.

Send me an email if you want work with me on this.

-Tony Lee

John R. Mashey

unread,

Feb 8, 1996, 3:00:00 AM2/8/96

to

In article <leetn-05029...@192.43.251.144>, le...@ccmail.apldbio.com (tlee) writes:
The following seems to use terms in a fundamentally different way than do
most people involved with this stuff:
I believe that most people would think a 1024-bit ALU can add/shift
2 1024 bit integers together. I believe that most people would
think that a 1024-bit FPU would operate on 2 1024-bit FP operands.

For multimedia things, it is not uncommon for an N*k-bit ALU to also act
like k N-bit ALUs, operating in parallel ... but if something is
truly implemented as k separate N-bit ALUs, and cannot do N-bit arithmetic,
hardly anyone would call this a k*N-bit CPU [except for random marketing silliness.]

Like many other chips, MIPS R80000s and R10000s have 2 integer ALUs,
in this case 64-bits wide. Nobody calls them 128-bit processors...

|> There are couple of apps can use 1024 bits ALU or FP today.
|>
|> 1) Multimedia - The higher the data bandwidth, the better.
|> 1024 ALU can help on encode/decode an image.

Having a 1024-bit ALU isn't going to help bandwidth any.
Having 16 64-bit ones in parallel might :-)

|> 2) 3D graphics
|> a) A point in 3D space is 3 * 64 Bits.
|> b) A matrix to operated on the points is 4x4x64 = 1024 bits - This
|> fits well.

In the first place, most 3D graphics codes use 32-bit FP.

In the second place, people use one or more geometry engines (i.e.,. special
FPUs) in parallel ... but they are *not* 1024-bit FPUs.

Dee Cee

unread,

Oct 20, 2022, 4:25:57 PM10/20/22

to

John R. Mashey schrieb am Donnerstag, 8. Februar 1996 um 09:00:00 UTC+1:

Hi, imagine how far we could zoom into a mandelbrot with 1024 bit processors!!

MitchAlsup

unread,

Oct 20, 2022, 5:56:16 PM10/20/22

to

How are you going to feed this thing ?
You would need 16 DRAM DIMMs per channel--1024 bit DRAM data bus.

Scott Lurndal

unread,

Oct 20, 2022, 6:20:54 PM10/20/22

to

1024 bits is four L1 cache (if 32-byte) lines or two L1 cache (if 64-byte) lines,
or a single L1 cache line for 128-byte cache lines (e.g. certain
modern DPUs). Inclusive L2/LLC is fed by a pair of DDR5 controllers
each with its own DIMM (add more controllers if the ISA supports
prefetch hints). Or HBM on die.

Quadibloc

unread,

Oct 20, 2022, 6:21:27 PM10/20/22

to

On Wednesday, January 31, 1996 at 1:00:00 AM UTC-7, Mike Schmit wrote:
> Current chips have 200 - 300 pins.

Huh? I thought that these days, microprocessors have from 1000 to 3000 pins.

John Savard

Scott Lurndal

unread,

Oct 20, 2022, 6:46:47 PM10/20/22

to

John, please read the posting date _before_ you reply. In 1996, that
wasn't a bad estimate.

MitchAlsup

unread,

Oct 20, 2022, 6:47:11 PM10/20/22

to

On Thursday, October 20, 2022 at 5:21:27 PM UTC-5, Quadibloc wrote:
> On Wednesday, January 31, 1996 at 1:00:00 AM UTC-7, Mike Schmit wrote:
> > Current chips have 200 - 300 pins.
<

I had 200 pins in 1985...........~1300 in 2005..........

<
> Huh? I thought that these days, microprocessors have from 1000 to 3000 pins.
<

And somewhere between 1/3rd and 1/2 of the pins are Pwr and Gnd.
>
> John Savard

MitchAlsup

unread,

Oct 20, 2022, 6:53:57 PM10/20/22

to

If you are chewing though data at 1024 bits per cycle (compared to
64-bits per cycle) your cache will need to be 16× wider (and bigger)
to achieve similar cache hit rates (and latency).
<
So, imaging a loop consuming 1024-bits per cycle and your typical 64KB
data cache. How many cycles does it take to wipe* the cache clean ?
{answer 512}
How many cycles does it take to wipe a 64KB cache at 64-bits per cycle?
{answer 8192}
So, you need more external BW to keep the processor fed -- or you should
not be building the system that big.
<
A.K.A. stripmine

Scott Lurndal

unread,

Oct 20, 2022, 7:14:05 PM10/20/22

to

MitchAlsup <Mitch...@aol.com> writes:
>On Thursday, October 20, 2022 at 5:20:54 PM UTC-5, Scott Lurndal wrote:

>> MitchAlsup <Mitch...@aol.com> writes:=20
>> >On Thursday, October 20, 2022 at 3:25:57 PM UTC-5, Dee Cee wrote:=20
>> >> John R. Mashey schrieb am Donnerstag, 8. Februar 1996 um 09:00:00 UTC+=
>
>> >How are you going to feed this thing ?=20

>> >You would need 16 DRAM DIMMs per channel--1024 bit DRAM data bus.

>> 1024 bits is four L1 cache (if 32-byte) lines or two L1 cache (if 64-byte=
>) lines,=20
>> or a single L1 cache line for 128-byte cache lines (e.g. certain=20
>> modern DPUs). Inclusive L2/LLC is fed by a pair of DDR5 controllers=20
>> each with its own DIMM (add more controllers if the ISA supports=20

>> prefetch hints). Or HBM on die.
><

>If you are chewing though data at 1024 bits per cycle (compared to=20
>64-bits per cycle) your cache will need to be 16=C3=97 wider (and bigger)

>to achieve similar cache hit rates (and latency).
><
>So, imaging a loop consuming 1024-bits per cycle and your typical 64KB
>data cache. How many cycles does it take to wipe* the cache clean ?
>{answer 512}
>How many cycles does it take to wipe a 64KB cache at 64-bits per cycle?
>{answer 8192}
>So, you need more external BW to keep the processor fed -- or you should
>not be building the system that big.

That's why one has multiple DDR5 controllers, to
feed the L3/LLC. Chips that need that bandwidth are designed
with large internal busses (512 bits wide in many case) to get
data (in a DPU, packet data) from DRAM, PCIe or 400Gb ethernet
to the processors, often with 'allocate' hints that allocate
directly into LLC. HBM can be built with a 1024 bit bus.

"An HBM stack of four DRAM dies (4-Hi) has two 128-bit channels
per die for a total of 8 channels and a width of 1024 bits in total"

Thomas Koenig

unread,

Oct 21, 2022, 1:40:29 AM10/21/22

to

Quadibloc <jsa...@ecn.ab.ca> schrieb:

> On Wednesday, January 31, 1996 at 1:00:00 AM UTC-7, Mike Schmit wrote:

^^^^^^^^^^^^^^^^

>> Current chips have 200 - 300 pins.
>
> Huh? I thought that these days, microprocessors have from 1000 to 3000 pins.

That post wasn't written "these days", it's a resurrected necro-thread.

Terje Mathisen

unread,

Oct 21, 2022, 2:54:43 AM10/21/22

to

People have already done that, I believe they needed 10+K bits precision...

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

luke.l...@gmail.com

unread,

Oct 21, 2022, 3:09:44 PM10/21/22

to

On Wednesday, January 31, 1996 at 8:00:00 AM UTC, Stefan Monnier wrote:

> It's much more likely to be:
> what would you do with those 1024bits ?
> what's the point ?
> if you can find a way to take advantage of a 1024bits datapath in more than a
> few special cases (like cryptography), maybe you could convince people to start
> thinking about it.

cool! a conversation that picks up again after 26 years!

interestingly i just recently got the (deep breath)
bitmanip-chain-using-one-register-as-a-64-bit-carry
instructions working in SVP64. bear in mind below
is MSB0 numbering:
https://libre-soc.org/openpower/sv/biginteger/

prod[0:127] = (RA) * (RB)
sum[0:127] = EXTZ(RC) + prod
RT <- sum[64:127]
RC <- sum[0:63]

repeated applications of that, when vectorised on
RA, result in the *expression* - at the ISA phase - of
arbitrary-length big-integer mathematics. this works
by allowing the hi-half of RC to be added in on the next
use of the 128/64 mul-add instruction.

a back-end ALU of any width (even 1024-bit) may then
be targetted by noting that the Vector Length has been
set to 16 (64x16=1024).

thus interestingly whilst the ISA may remain at least
familiarly-scalar (or familiarly-arbitrarily-Vector-Scalable)
the implementor is free and clear to choose between
either small ALUs and allow chaining across multiple
clock cycles, or spam a massive-wide back-end ALU
instead.

(in other words it is the Cray "Vector Chaining" thing,
all over again)

point being you *don't* need to actually design a
1024-bit ISA, you can design a Scalable one instead.

l.

Josh Vanderhoof

unread,

Oct 21, 2022, 8:30:56 PM10/21/22

to

In graphics, drawing instanced meshes (i.e. drawing the same object
multiple times in different positions) can do a lot of work while
staying in the cache.

Josh Vanderhoof

unread,

Oct 21, 2022, 8:38:20 PM10/21/22

to

There are some really cool deep mandelbrot zoom videos on Youtube. It's
pretty amazing if you remember watching mandelbrot renders trickling out
pixel by pixel on an Amiga back in the 80's.

Thomas Koenig

unread,

Oct 22, 2022, 5:49:55 AM10/22/22

to

Josh Vanderhoof <x@y.z> schrieb:

> There are some really cool deep mandelbrot zoom videos on Youtube. It's
> pretty amazing if you remember watching mandelbrot renders trickling out
> pixel by pixel on an Amiga back in the 80's.

Or on a Commdore 64...

At the time, I wrote a Mandelbrot program which invoked the
C-64 BASIC floating point routines directly. It gave a
10* performance boost over the BASIC version.

A friend with an Amiga was _much_ faster with a simple BASIC
verison. I gave up on Mandelbrot then, only coming back to it
later on a vector computer, with a laser printer as output.