Is Von Neumann Doomed ???

10 views
Skip to first unread message

Mark Thorson

unread,
Jul 9, 1994, 12:57:41 AM7/9/94
to
By the end of the decade, advances in FPGA density will make
the von Neumann architecture totally obsolete. People won't
build hardwired CPU's when they can build soft CPU's that can
transform themselves into the most appropriate architecture for
the applications to which they are applied.

For example, a computer animation program might need a hardware
raster-drawing engine. A multimedia system could have a DSP-like
architecture. A text editor might have hardware for fast drawing
of fonts or window scrolling.

When will this transformation of our industry take place? Very
soon. The critical mass will be achieved when FPGA's get large
enough to implement a 386-class processor. After that, it won't
make sense to build dedicated CPU's.

These FPGA's will have a number of differences from the FPGA's
we have today. Here are some features they are likely to have:

* FPGA Program Loader -- today's FPGA's have slow and clumsy
mechanisms for loading the SRAM cells that define their
circuitry. Future FPGA's will have fast mechanisms so that
the cost of reloading part of the circuitry will be low.
These mechanisms will resemble the instruction prefetch logic
of today's high-end microprocessors. Loading can be controlled
by the FPGA circuitry itself, and can be selective so that
it isn't required to reload the whole chip when you only
want to change a small part.

* Alternate FPGA Program Cells -- duplicate sets of the SRAM
cells which control FPGA functionality will allow hardware
subroutine-like context switches, so that frequently-used
functions can be preloaded and switched in temporarily without
any run-time loading overhead. Again, this can be applied
selectively. For example, on a cache miss you can switch
out your cache controller and switch in your DRAM controller
on the next cycle.

* Translation Lookaside Buffer -- FPGA-based computers will
still need to talk to memory. To accelerate this process,
they certainly will have translation lookaside buffers, and
they might have segment registers. :-)

* RISC CPU -- this will be an absolutely dirt-simple CPU,
probably with just one register like the PDP-8. It will be
used for booting up the FPGA after power-on reset and
dispatching interrupts to the circuits that service interrupts.
It will be designed for minimum die area, rather than
performance.

Here are some features of today's CPU's that FPGA-based CPU's
are not likely to have, because they can be efficiently
implemented in FPGA cells or they simply aren't needed:

* Addressing Mode Logic -- compilers directly generate circuitry
which step through arrays, push things on stacks, etc. No
general-purpose address logic is needed.

* Register Files -- only those registers which are needed are
implemented. No space is wasted on unused registers. Putting
registers in arbitrary places in the CPU eliminates some buses
(interconnect often eats up half the die on today's CPU's).

* Integer ALU's -- you don't need all the circuitry to implement
logical operations or even subtraction if all you're planning
to do with an ALU is add. This allows more parallel ALU's to
be crammed into the FPGA chip, even though its gate density
is lower than a hardwired CPU made from equivalent technology.

* Floating-point unit -- barrel shifters and multipliers eat up
lots of chip real estate, which can be more profitably used
by other structures if you're doing integer work, such as
text editing.

Why will FPGA-based CPU's be faster than hardwired CPU's?
They will have greater parallelism. At best, superscalar
techniques seem to result in a real-world performance of
about two instructions per clock. An FPGA-based CPU,
however, has no restrictions on the simultaneous operation of
its parts. It resembles a circuit board more than an
instruction-list processor.

How will this change the programmer's interface? Compilers
will need to deal with the machine at a much lower level of
interface. These compilers will evolve from present-day hardware
description languages like VHDL. They probably will be visual
programming languages, making the floorplan of the machine
accessible to the programmer.

So now that I've explained it, does everyone agree this is the
correct vision of the future? Criticism is invited.

[Note: Formerly m...@cup.portal.com, but now e...@netcom.com.]

Ulrich Graef

unread,
Jul 9, 1994, 2:44:08 PM7/9/94
to
In article <eeeCsn...@netcom.com>, e...@netcom.com (Mark Thorson) writes:
> By the end of the decade, advances in FPGA density will make
> the von Neumann architecture totally obsolete. People won't
> build hardwired CPU's when they can build soft CPU's that can
> transform themselves into the most appropriate architecture for
> the applications to which they are applied.

... detailed description deleted

Ok, I would agree, that from a technical viewpoint this concept
has advantages in $s per operation.

But:

FPGA-layout, object format, standards:

Which standard FPGA-architecture do you want to use?
If the technology steps forward, you will have a bigger
FPGA which cannot run the old `programs' or will not
use large parts of the chip area.
You must define an common object format. If you use
directly data for the sram-cells, you cannot use other
FPGAs (newer, bigger). If you use an VHDL-description,
you must compile, everytime you start a program.


Time-sharing, multi-processing

It will be (as in every real computer), that you have
limited computing resources, especially a limited number
of FPGAs (if not only 1).
At process switch, you must not only save your registers,
you must also save all internal states of your FPGA
(and reload the states, if you enter back to this process).
Imagine, you have built a special Flip-Flop (programmed in
your FPGA) to compute something very clever and fast.
You must implement a lot of additional logic, that you can
set the state of this register correctly to re-enter the process.


Consider VLIW architectures

VLIW-computers (Very Large Instruction Words) are similar
or a step to your concept. They load some instructions at a
time and perform multiple operations in parallel.
There is no major computer line which is based on VLIW.


Memory Bandwidth

You will have the same problem as VLIW architectures or
some superscalar processors (DEC alpha, ...) or the very fast ones.
The fast processor waits a lot of time to get the data out
of the memory.
Do you want to implement all memory in your FPGA?
Most programs used today (f.e. a texteditor as you mentioned)
need dynamic data structures, which cannot be implemented
statically in your FPGA.


Therfore I can only see advantages to use your concept in embedded applications
or special purpose hardware (also used in peripherals of custom computers).

... and thats the place, where FPGAs are very successfull today!

Sincerely,

Uli

--
Ulrich Graef | home: 06155 62493 int: +49 6155 62493
Lichtenbergweg 11 | office: 06103 752 364 int: +49 6103 752 364
D-64347 Griesheim +-------------------------------------------------------
Germany | e-mail: gr...@iti.informatik.th-darmstadt.de

Zaid Kadhim Salman

unread,
Jul 9, 1994, 3:08:36 PM7/9/94
to
In article <eeeCsn...@netcom.com>, e...@netcom.com (Mark Thorson) writes:
|
|>
|> When will this transformation of our industry take place? Very
|> soon. The critical mass will be achieved when FPGA's get large
|> enough to implement a 386-class processor. After that, it won't
|> make sense to build dedicated CPU's.

Yeah right. I'd like to know what the basis is of your prognostication.
Do you work for an FPGA company that is devloping this cutting
edge technology? Are you with the government, and are going
to be declassifying information?


|>
|> Why will FPGA-based CPU's be faster than hardwired CPU's?
|> They will have greater parallelism. At best, superscalar
|> techniques seem to result in a real-world performance of
|> about two instructions per clock. An FPGA-based CPU,
|> however, has no restrictions on the simultaneous operation of
|> its parts. It resembles a circuit board more than an
|> instruction-list processor.

Okay, but what about scheduling? That's still an issue. If you can't
find the parallelism in the code, how can you run it?

|>
|> How will this change the programmer's interface? Compilers
|> will need to deal with the machine at a much lower level of
|> interface. These compilers will evolve from present-day hardware
|> description languages like VHDL. They probably will be visual
|> programming languages, making the floorplan of the machine
|> accessible to the programmer.
|>
|> So now that I've explained it, does everyone agree this is the
|> correct vision of the future? Criticism is invited.

Not at all. I think you're predictions are based entirely on speculation
and ideals. That's fine, those are usually what keeps innovations coming.
But don't start making broad, sweeping statements about what's dead and
what's not without any hard evidence. Listen to this:

"In theory, there's no difference between theory and practice,
but in practice, there is."

Ever since the birth of technology, people who invent new techologies
say that those will make everything else obsolete. At Michigan, there is a
group
that's working on a superscalar MIPS R3000 implemented in GaAs. They
believe
that GaAs is the next step beyond silicon. Also at Michigan, there are
guys
working on multi-valued devices, and THEY believe that those are the
next step beyond binary devices. People who do optical computers
say optical computing is the next step beyond electrical computing. People
who do superconductors say superconductors is the next step to
computing. I personally believe that 3D fab. is the next step in
computing. Or maybe neural nets.

The point is that that kind of prognostication has been going on
for decades, and the only people who are right are the ones who have
made their proposed technology faster and cheaper than what currently
exists. Otherwise there is no reason to use it just because it
looks good in theory.

I'll believe Von Neumann is dead when it actually dies.

--
Zaid, not romantically/musically/architecturally (choose one) speaking for SGI

Preston Briggs

unread,
Jul 9, 1994, 2:36:48 PM7/9/94
to
e...@netcom.com (Mark Thorson) writes:
>By the end of the decade, advances in FPGA density will make
>the von Neumann architecture totally obsolete. People won't
>build hardwired CPU's when they can build soft CPU's that can
>transform themselves into the most appropriate architecture for
>the applications to which they are applied.

[...]

>So now that I've explained it, does everyone agree this is the
>correct vision of the future?

I don't agree. Things may change, and FPGAs may become amazing dense,
but you'll still be waiting on compilers at the end of the decade.

Currently, we obtain great (and I believe necessary) simplifications
by fixing many pieces of the overall problem. We assume fixed-sized
integers and pointers, certain FP formats, registers, cache, etc.
These assumptions give us a standard environment to work with. If you
throw all these assumptions away and attempt to create optimized
hardware for each new program, you'll only make the (already
intractible) compilation problems much more difficult.

You gave a time frame of this decade. I'm hoping, perhaps
optimistically, that we see some of the current research coming out in
production form by the end of the decade. Things just don't happen
quickly enough for a grand revolution in compilation and architecture,
of the sort you imply, to be accomplished in 5 or 6 years.

Preston Briggs

John Lazzaro

unread,
Jul 9, 1994, 3:20:47 PM7/9/94
to
In article <eeeCsn...@netcom.com>, Mark Thorson <e...@netcom.com> wrote:
> [... arguments for FGPA's replacing microprocessors in comp.arch ...]

>
>So now that I've explained it, does everyone agree this is the
>correct vision of the future? Criticism is invited.
>

Another pass needs to be made from the VLSI perspective, as opposed to
the computer architecture perspective. Looking at both the wires and
the transistors. There are electrical and structural prices to be paid
for the flexibility of dynamically reprogrammability that need to be
factored into the equation; these may end up being the dominant terms.


Scott Pakin

unread,
Jul 9, 1994, 3:35:53 PM7/9/94
to
In article <eeeCsn...@netcom.com> e...@netcom.com (Mark Thorson) writes:

[DELETED: Explanation of why denser FPGAs will make dedicated microprocessors
obsolete.]

> So now that I've explained it, does everyone agree this is the
> correct vision of the future?


I disagree.

First, density is not the only issue. Speed is also important. I'd guess that
an FPGA-based CPU would be about an order of magnitude slower than the
equivalent in custom VLSI. You claim that the gain would be in parallelism.
That's fine when your application has a lot of parallelism. But what about
sequential sections of your code? Those will run about 10 times slower than on
a general-purpose CPU. And don't forget about the time it takes to download a
new design. Let's say, optimistically, that it takes 1/4 second to download a
microprocessor to an FPGA. On a 200MHz processor with CPI=1.0, that's the
equivalent of 50 million instructions. So if your code looks like this:

begin loop
sequential section <-- 10x slower than VLSI
download parallel section <-- +50M instructions on VLSI
parallel section <-- ???x faster than VLSI
download sequential section <-- +50M instructions on VLSI
end loop

there's an amortized cost issue. Your parallel section had better be *really*
big and almost completely parallelizable in order to see an improvement over
the hardwired microprocessor. Remember, if an FPGA-based microprocessor is ten
times slower than a four-way superscalar hardwired machine, it had better
sustain a minimum of 40-fold parallelism. In addition, if the parallel section
performs a fairly common operation (e.g. FFT), it can probably be done pretty
well already with a DSP chip.

But let's assume that you write a program that's completely parallelizable,
unique in operation, and can be handled by a single custom-designed
microprocessor (i.e. you need only one FPGA download). If there's anyone else
on the system who also has a completely parallelizable program that can be
handled by a single, but *different*, custom-designed microprocessor, you'll
both be killed on context switches. Speaking of which, how do you write an
operating system for a microprocessor that keeps changing? You'd have to
download a program to the FPGA for the OS. There goes another 100M
instructions you could have executed on a general-purpose hardwired
microprocessor.

So let's say that there will be only one program ever running on the machine,
just like in the old batch-processing days. Or maybe that the FPGA is so
incredibly massive that you can fit the hardware for every section of every
user's code. (In the latter case, a hardwired microchip could probably use the
same quantity of chip real estate for a bigger cache or another processor.)
Basically, let's pretend that the download time is negligible. Oh, yeah: We
also have to pretend that we can fit a big chunk of each program's memory on
the FPGA, too, or we'd be killed by memory bandwidth limitations. Can we
realistically compile most of the programs we need to run such that they'd run
faster on an FPGA-based system than on a hardwired system? Can we debug them?
Do we have to train hordes of single-language software grunts in hardware
design? (Ok, so that last question does not raise a technical issue, but it's
still worth considering.)

Don't get me wrong; I think FPGAs are really neat and, as yet, underutilized.
I just see the flexibility vs. performance issue as too overwhelming for FPGAs
to make less flexible designs obsolete. While by no means an original idea,
one alternative to replacing VLSI with FPGAs is to implement the main processor
in custom VLSI but add an FPGA co-processor. That way, sequential and
marginally-parallel code can execute on one and massively-parallel code can
execute on the other. I believe there's a group in France that's even writing
compilers for that sort of machine.

-- Scott

Mark Thorson

unread,
Jul 9, 1994, 6:27:21 PM7/9/94
to
This arrived by e-mail. I'm posting it because he can't.

----------------------------- begin quoted text -------------------------

Date: Sat, 9 Jul 94 13:51:45 CDT
From: kop...@gate.ee.lsu.edu (David Koppelman)
Message-Id: <940709185...@gate.ee.lsu.edu>
Received: by omega.ee.lsu.edu.ee.lsu.edu (4.1/SMI-4.1)
id AA01239; Sat, 9 Jul 94 13:51:32 CDT
To: e...@netcom.com
Subject: Von Neumann & Fate
Status: R

I tried to post the following, but my newsreader would not take
it. (If you like you could post it for me.)

In-reply-to: e...@netcom.com's message of Sat, 9 Jul 1994 04:57:41 GMT
Newsgroups: comp.arch
Subject: Re: Is Von Neumann Doomed ???
References: <eeeCsn...@netcom.com>
Distribution:
--text follows this line--


In article <eeeCsn...@netcom.com> e...@netcom.com (Mark Thorson) writes:

>By the end of the decade, advances in FPGA density will make
>the von Neumann architecture totally obsolete. People won't
>build hardwired CPU's when they can build soft CPU's that can
>transform themselves into the most appropriate architecture for
>the applications to which they are applied.
>
>For example, a computer animation program might need a hardware
>raster-drawing engine. A multimedia system could have a DSP-like
>architecture. A text editor might have hardware for fast drawing
>of fonts or window scrolling.

...



>So now that I've explained it, does everyone agree this is the
>correct vision of the future? Criticism is invited.
>

Several problems would have to be overcome:

1) What do you do about context switching? Such a system
would either have to spend a considerable (wrt a time slice) time
reconfiguring, or else the FPGA would be divided between processes
(like a cache) with only a small portion being active at any time.

2) Could a compiler fully exploit it? Such a system would be great
for certain applications, such as signal processing, graphics, and
other applications having systolic computations or frequently
repeated operations. Could a conventional compiler discover and
exploit these? If it were up to the programmer then debugging
would be a nightmare. I would guess that in most cases the compiler
would specify something like a conventional CPU; perhaps several
in parallel. Would a programmable gate array offer any advantages
under those circumstances?

Of course for certain applications, nothing could touch it.



/\ /\ /\
<> <> David M. Koppelman <>
<> kop...@gate.ee.lsu.edu <> 102 EE Building <>
<> (504)-388-5482 <> Louisiana State University <>
<> <> Baton Rouge, La 70808 <>
~~ ~~ ~~


Steve Heller

unread,
Jul 10, 1994, 12:04:57 AM7/10/94
to

e...@netcom.com (Mark Thorson) wrote:


>By the end of the decade, advances in FPGA density will
>make the von Neumann architecture totally obsolete.
>People won't build hardwired CPU's when they can build
>soft CPU's that can transform themselves into the most
>appropriate architecture for the applications to which
>they are applied.
>
>For example, a computer animation program might need a
>hardware raster-drawing engine. A multimedia system
>could have a DSP-like architecture. A text editor
>might have hardware for fast drawing of fonts or
>window scrolling.
>

[text omitted for brevity]

What a great idea! I certainly hope you're right, because
this approach will make it possible to design one's own
language for a specific application and get phenomenal
performance as a result. Since I love designing languages,
this is a wonderful prospect for me.


Steve Heller (sthe...@pipeline.com)

Henry J. Cobb

unread,
Jul 10, 1994, 12:43:07 PM7/10/94
to
In article <2vmsjk$3...@miranda.mti.sgi.com> za...@layout3.mti.sgi.com (Zaid

Kadhim Salman) writes:
>
>Okay, but what about scheduling? That's still an issue. If you can't
>find the parallelism in the code, how can you run it?
>--
>Zaid, not romantically/musically/architecturally (choose one) speaking for SGI

Exactly, the high performance desktops to come will not be
programmed by procedural languages that attempt to squeeze everything
through a single thread of execution.

Instead a visual, object oriented language, with explicit scheduling
at the object boundary will be used.

The true Von Neumann bottleneck is a single thread of instruction
fetch. The compiler must guess at which program objects will be
concurrently active, and if it fails, the many execution units of a
super scalar or VLIW machine starve.

There is another, much easier path that does not wait for FPGAs to
"catch up". Simply allow the super scalar machine to fetch instructions
along several different threads of execution. (i.e.: Microthreading).
--
Henry J. Cobb hc...@fly2.berkeley.edu
All items Copyright (c) 1994, by their respective authors, permission is
granted to redistribute as long as proper credit is given.

Rajesh Gupta

unread,
Jul 10, 1994, 1:32:59 PM7/10/94
to
In article <eeeCsn...@netcom.com>, e...@netcom.com (Mark Thorson) writes:
|> By the end of the decade, advances in FPGA density will make
|> the von Neumann architecture totally obsolete. People won't
|> build hardwired CPU's when they can build soft CPU's that can
|> transform themselves into the most appropriate architecture for
|> the applications to which they are applied.
|>
...


You gloss over several fundamental problems in making this prediction
-- specification (programming), synthesis (compilation), and
verification (debug) -- all of these will have to undergo revolution
before the "soft cpu" can hit the mainstream. The problem of
generating efficient FPGA hardware from "simple" (hardware description)
language descriptions is far from solved. Even an infinitely dense
FPGA can not help in this since the existence solution for compilation
of arbitrary language descriptions into (arbitrary) hardware
architectures is as yet unproven. Then there are perils of true
parallel programming that is needed to bring out the inherent
parallelism - it is just not the practice.

Existing data formats make it possible to do some of the optimizations
necessary for generating an efficient implementation. If anything,
prevalence of FPGA will prehaps force even more standardization *for the
general-purpose computing* environments.

Yes, FPGA will have a role in specific application areas - embedded
systems for example.

Of course, present model of computation can not
continue for ever, but a total revolution in programming environment,
compilation and machine architecture by the end of this decade does
not look likely.


Rajesh

David Jones

unread,
Jul 10, 1994, 3:34:33 PM7/10/94
to
In article <eeeCsn...@netcom.com> e...@netcom.com (Mark Thorson) writes:
>By the end of the decade, advances in FPGA density will make
>the von Neumann architecture totally obsolete. People won't
>build hardwired CPU's when they can build soft CPU's that can
>transform themselves into the most appropriate architecture for
>the applications to which they are applied.

Although research is being undertaken in this area, I doubt that FPGAs will
obsolete the von Neumann architecture.

Today, you can get systems that incorporate programmable logic onto a CPU.
I believe Motorola is selling 68000,68020 and 68040 CPU cores alongside a
gate array (mask programmable). You submit your design to Motorola, and you
end up with a custom embedded controller. It is only a matter of time before
someone puts an FPGA on a microprocessor.

However, I also believe that the reverse will also happen: FPGAs are nice for
small circuits, but the flexibility of current architectures leaves much to
be desired for large circuits. The largest Xilinx has about 1024 CLBs, where
each CLB has 2 4-input lookup tables (LUTs), a 3-input LUT, and 2 flip-flops.
Many FPGA-based systems today require more logic than that. They end up
putting more than one FPGA on a board, and the interconnect gets messy.

One way to achieve greater density with reconfigurable logic is to time-
multiplex the logic blocks. That is, a circuit can be levellized, and
the levels can be evaluated sequentially. Some graduate students working
under Dr. Kuh at the University of California, Berkeley developed such a
device, and my Master's thesis concerns itself with the design of a
high-capacity time-multiplexed FPGA. Since time-multiplexed devices replace
routing channels (high capacitance) with a memory, the devices can often
run at higher clock speeds, which mitigates somewhat the performance impediment
of sequential evaluation.

Something that I do not intend to investigate for my thesis (and which is
being investigated by other researchers at the University of Toronto) is
how best to incorporate high-density memory into an FPGA-based system.
Xilinx gives you a few words of 4-bit memory on an FPGA, but serious
applications require large amounts (megabytes or more) of memory, which
is easily provided in the form of off-the-shelf DRAM chips. The problem
is how best to have them addressed by the FPGA system.

A time-multiplexed FPGA resembles a microprocessor to a great degree. My
design will be programmed with a number of instructions, of a VERY simple
format. There are no conditional branches, and there is only one "ALU"
operation: a 4-bit LUT. However, the principles of time-multiplexed FPGAs
can be generalized to include macrocells such as 32-bit adders, shift
networks, and the like. It will soon get to a point where such an FPGA
will achieve better-than-microprocessor performance for certain specialized
tasks.

The problem right now with FPGA-based processing is twofold: lack of memory
(which is being addressed to a certain degree) and the overhead required to
program the device (if the FPGA is to replace a processor, then we will be
running a multitasking OS on it, right?) Since the time-multiplexed FPGAs
are already running programs fetched from wide memories, these devices will
make good prototypes for the investigation of FPGA-based computers.

Until then, RISC rules.

Christopher J. Vogt

unread,
Jul 10, 1994, 7:05:35 PM7/10/94
to
In article <eeeCsn...@netcom.com>, Mark Thorson <e...@netcom.com> wrote:
>By the end of the decade, advances in FPGA density will make
>the von Neumann architecture totally obsolete. People won't
>build hardwired CPU's when they can build soft CPU's that can
>transform themselves into the most appropriate architecture for
>the applications to which they are applied.

[...]

>
>So now that I've explained it, does everyone agree this is the
>correct vision of the future? Criticism is invited.
>

Hmm. I'll bet you never heard of the company Nanodata, who
sold a soft archetecture machine called the QM-1 in the
late 70's (I used to program that beastie for McDonnel Douglas).
I don't think the company was very successful. Your observation
seems logical, but historicly it didn't work out that way, although
I won't venture a guess as to why.

--
Christopher J. Vogt vo...@netcom.com
From: El Eh, CA

Herman Rubin

unread,
Jul 11, 1994, 9:00:35 AM7/11/94
to
In article <HCOBB.94J...@fly2.berkeley.edu> hc...@fly2.berkeley.edu (Henry J. Cobb) writes:
>In article <2vmsjk$3...@miranda.mti.sgi.com> za...@layout3.mti.sgi.com (Zaid
> Kadhim Salman) writes:

.............................

> The true Von Neumann bottleneck is a single thread of instruction
>fetch. The compiler must guess at which program objects will be
>concurrently active, and if it fails, the many execution units of a
>super scalar or VLIW machine starve.

I doubt that there is a true "Von Neumann" machine around now. Those
machines had single instruction threads (a few did have IO in parallel
with computing), and none of the present type registers. The only
addressing mode was immediate, which meant that programs had to be
self-modifying. A few had indirect addressing. The registers were
specifically involved in computation, such as the accumulator, which
was where addition was actually carried out, and the multiplier/quotient
register, which held the multiplier, and where the quotient was formed.
Unlike the present machines, these registers were directly accessible
to the program.

Except for drum and delay-line machines, where memory was not accessible
in constant time, the only things the programmer or compiler had to
keep track of was what was in the few registers, and what was where
in memory. In addition, most arithmetic operations altered at least
one of those few registers.
--
Herman Rubin, Dept. of Statistics, Purdue Univ., West Lafayette IN47907-1399
Phone: (317)494-6054
hru...@stat.purdue.edu (Internet, bitnet)
{purdue,pur-ee}!snap.stat!hrubin(UUCP)

David Chase

unread,
Jul 11, 1994, 12:18:51 PM7/11/94
to
pre...@noel.cs.rice.edu (Preston Briggs) writes:
|> You gave a time frame of this decade. I'm hoping, perhaps
|> optimistically, that we see some of the current research coming out in
|> production form by the end of the decade. Things just don't happen
|> quickly enough for a grand revolution in compilation and architecture,
|> of the sort you imply, to be accomplished in 5 or 6 years.

Ah, but this is only true because the people with cash to fund
such efforts are too chicken to risk it. You could probably name
10 sharp compiler people you want to have working with you
if you planned to do this, and you can probably imagine that each
of those people could be convinced with 5 years of employment at
a particular salary -- it's hard, sure, but it's possible. At the
outside, you're talking $5 million per year (500,000 per person,
per year), for 5 years -- to develop a new technology. Compare
that with what people spend on chip design for existing architectures,
and that isn't too bad.

Of course, the hard part is convincing the people with money that
you've got the right 10 people, and that this is the best place
to put their money (best return on investment), and that
something will actually come of it. They're not interested
in paying 10 people to play for 5 years with nothing to show.

David Chase (speaking for myself, as if it weren't obvious)

Timothy Jehl~

unread,
Jul 11, 1994, 12:21:57 PM7/11/94
to

> >By the end of the decade, advances in FPGA density will make
> >the von Neumann architecture totally obsolete. People won't
> >build hardwired CPU's when they can build soft CPU's that can
> >transform themselves into the most appropriate architecture for
> >the applications to which they are applied.
> >
> >For example, a computer animation program might need a hardware
> >raster-drawing engine. A multimedia system could have a DSP-like
> >architecture. A text editor might have hardware for fast drawing
> >of fonts or window scrolling.
>
> ...
>
> >So now that I've explained it, does everyone agree this is the
> >correct vision of the future? Criticism is invited.
> >

The problem is performance. There is absolutly no way to get the sort
of speed you want out of a high performance processor with a device meant to
implement general purpose logic. The functionality may or may not be doable,
but the performance won't be there.

TJ
--

Tony Lee

unread,
Jul 11, 1994, 12:35:31 PM7/11/94
to

Great topic for comp.arch.

If I design a new computer based on FPGA, I will take a different (more
evolutionary) approach.

Step 1. Design the high density FPGA in to a popular computer (PC) as a
coprocessor with access to all the address, data and other hardware line
such as instruction fetch.

Step 2. Design the software (VHDL) for the FPGA to analyze the performance
bottleneck of the CPU bound software such as 3D rendering, RayTracing,
Spec92int, Spec92Fp, popular FPGA/PCB Routing software etc by monitor the
address, data and instrution access patterns of these program. Most
researches indicate that 90% of the CPU time is spend on 10% of the code.
We will optimize that 10% of that code after we identify what they are.

Step 3. After analyze the data from step 2, design the new VHDL to
optimize only the critical regions. These VHDL code will be designed to
speed up very small amount of object code where the app spends 90% of its
time on. We can insert the BREAK instruction in the corresponding routines
in the original software for the transparently change.

Step 4. Rerun the software with FPGA and see how fast can the optimization
goes. I believe for most of the CPU bound application, the speed up will
be so amazing that there will be a great deal of commercial potential on
this design.

Now imagine, with this design you can

1. Reroute your the FPGA or PCB designs in minutes or maybe seconds instead
of hours with the current fastest software/hardware.

2. Get a true hard realtime RayTracing virtual reality engine rendering
extremly smoothly on your computer. Just think about the game potential!

3. Redraw the most complicated Autocad design in 10th or100th of seconds!

If anyone is interested in and has acess to resources/experiences to
commercialize this, I'ld love to hear from him/her. I had a bussiness plan
outline for this project already.

Tim Callahan

unread,
Jul 11, 1994, 3:06:48 PM7/11/94
to
In article <2vmu6p$f...@vixen.cso.uiuc.edu>,

Scott Pakin <pa...@indigo-rilla.cs.uiuc.edu> wrote:
>
>That's fine when your application has a lot of parallelism. But what about
>sequential sections of your code? Those will run about 10 times slower than on
>a general-purpose CPU. And don't forget about the time it takes to download a
>new design. Let's say, optimistically, that it takes 1/4 second to download a
>microprocessor to an FPGA. On a 200MHz processor with CPI=1.0, that's the
>equivalent of 50 million instructions. So if your code looks like this:
>
> begin loop
> sequential section <-- 10x slower than VLSI
> download parallel section <-- +50M instructions on VLSI
> parallel section <-- ???x faster than VLSI
> download sequential section <-- +50M instructions on VLSI
> end loop
>
>there's an amortized cost issue. Your parallel section had better be *really*
>big and almost completely parallelizable in order to see an improvement over
>the hardwired microprocessor.

Yes, Amdahl's Law can be brutal in this case. However, there may
be some ways of reducing the costs for the not-so-parallel sections of
code. It seems that an *incrementally* reconfigurable gate array
would be desirable -- perhaps the existing datapath is general enough that
you can run the new code section by just reconfiguring the control section.
The obvious solution (stated in other posts and later in this article)
is to have a very simple hardwired RISC sitting off to side for running
those sections of code that don't have enough parallelism to make
the reconfiguration cost worthwhile. The RISC is also useful for running
kernel calls -- less state to preserve.

Tim Callahan
UC Berkeley / ICSI


alan.j.greenberger

unread,
Jul 11, 1994, 1:21:18 PM7/11/94
to
In article <eeeCsn...@netcom.com> e...@netcom.com (Mark Thorson) writes:
>By the end of the decade, advances in FPGA density will make
>the von Neumann architecture totally obsolete. People won't
>
STUFF DELETED

>So now that I've explained it, does everyone agree this is the
>correct vision of the future? Criticism is invited.
>
>[Note: Formerly m...@cup.portal.com, but now e...@netcom.com.]

There is a publication on this subject:
P. Athanas and H. Silverman, "Processor Reconfiguration Through Instruction-
Set Metamorphosis", IEEE Computer, p. 11, March 1993

Tim Callahan

unread,
Jul 11, 1994, 3:50:01 PM7/11/94
to
In article <eeeCsn...@netcom.com>, Mark Thorson <e...@netcom.com> wrote:
>By the end of the decade, advances in FPGA density will make
>the von Neumann architecture totally obsolete. People won't
>build hardwired CPU's when they can build soft CPU's that can
>transform themselves into the most appropriate architecture for
>the applications to which they are applied.
>
>For example, a computer animation program might need a hardware
>raster-drawing engine. A multimedia system could have a DSP-like
>architecture. A text editor might have hardware for fast drawing
>of fonts or window scrolling.

This first step (using an FPGA to implement an
application-specific processor which is then stable
for the life of the program that is running) was studied in
"Flexible Processors: A promising application-
specific processor design approach", Andrew Wolfe and John P. Shen,
MICRO '21: Proceedings of the 21st Annual Workshop on Microprogramming
and Microarchitecture, IEEE Computer Society, 1988.
The system was built using Xilinx parts.
Each application-specific processor design is done
by hand rather than automatically generated.
(I admit it would be pretty tough to have a compiler look at
100,000 lines of C code and try to deduce what the best ISA
for executing it is -- I think it's easier to look towards the next
step, where you can dynamically reconfigure the gate array
for each section of code).

Hmm, actually there are many paths of evolution. This is one --
using an FPGA to implement an application-specific processor,
and then decreasing the reconfiguration costs so that it becomes
practical to reconfigure at the subroutine and, who knows, possibly
basic-block level. An alternate path is to integrate an FPGA as a
coprocessor with a current RISC microprocessor: at first, only
highly parallel / pipelined inner loops execute on the FPGA,
but as evolution proceeds, more and more of the operations get
moved to the FPGA. This path is being taken by many,
including those working on the PRISM project,
and Andre DeHon who's at the MIT AI Lab...
sorry I don't have the exact references, but my stuff is at work.
You can probably find your way to Andre DeHon's stuff through WWW
though.

Yet another point of view is to come at it from the hardware side
-- think of your C program as an HDL, and directly compile
it to hardware (assuming you can figure out how to do recursive subroutine
calls and the like). You will end up with a huge circuit -- now just
figure some way of efficiently "paging" it onto your physical FPGA...

Thomas Charles CONWAY

unread,
Jul 11, 1994, 10:23:44 PM7/11/94
to
Would anyone like to comment on the experience of the Lisp machines
and Prolog machines with respect to this discussion?

I don't know the details, but it seems that in the time it took to
implement a novel processor, conventional processors had gained
enough speed to more than compensate....

Thomas
--
Thomas Conway con...@cs.mu.oz.au
Decidability is like a balloon: one little hole and it vanishes.

Jeff Cunningham

unread,
Jul 12, 1994, 1:40:32 AM7/12/94
to
In article <eeeCsn...@netcom.com>
e...@netcom.com (Mark Thorson) writes:

> By the end of the decade, advances in FPGA density will make
> the von Neumann architecture totally obsolete. People won't
> build hardwired CPU's when they can build soft CPU's that can
> transform themselves into the most appropriate architecture for
> the applications to which they are applied.
>

[...]


>
> So now that I've explained it, does everyone agree this is the
> correct vision of the future? Criticism is invited.

As someone who has designed zillions of lines of von neumann code and
designed zillions of xilinx based hardware designs, it seems
intuitively obvious to me that FPGAs hold all the promise of what you
(and many others) are suggesting. As far as getting to where it is
practical, it is as they say "simply a matter of programming".
Personally, I predict this will be the biggest thing since wafer scale
integration.

-Jeff.

Michael Brady

unread,
Jul 12, 1994, 5:29:49 AM7/12/94
to
In article <2vs7p9$i...@agate.berkeley.edu>, timo...@ICSI.Berkeley.EDU (Tim
Callahan) wrote:

> (I admit it would be pretty tough to have a compiler look at
> 100,000 lines of C code and try to deduce what the best ISA
> for executing it is -- I think it's easier to look towards the next
> step, where you can dynamically reconfigure the gate array
> for each section of code).
>
> Hmm, actually there are many paths of evolution. This is one --
> using an FPGA to implement an application-specific processor,
> and then decreasing the reconfiguration costs so that it becomes
> practical to reconfigure at the subroutine and, who knows, possibly
> basic-block level. An alternate path is to integrate an FPGA as a
> coprocessor with a current RISC microprocessor: at first, only
> highly parallel / pipelined inner loops execute on the FPGA,
> but as evolution proceeds, more and more of the operations get
> moved to the FPGA. This path is being taken by many,
> including those working on the PRISM project,
> and Andre DeHon who's at the MIT AI Lab...

I agree completely with this (though I'd appreciate a few of those
references you spoke about :-)).

It seems to me that we should not try to solve the general case - instead
we should try out a few special problems. After all, a floating point
coprocessor is a very useful very special purpose device.

If we could automagically design some very fast _simple_ special purpose
processors, that would be something. It would be as if we started with a
general purpose processor and specialised it so much that (a) it didn't
need an instruction stream and (b) was only useful for one [category of]
job.

My guess is that some short programs in Prolog or some of the functional
languages would be good places to start. One can envisage language parsers
and similar hardware machines being generated from grammars.

I really don't know how such processors could be integrated into general
computer systems though...

--
Mike Brady
Computer Science Department, Trinity College Dublin, Ireland

John McClain

unread,
Jul 12, 1994, 8:31:14 AM7/12/94
to

In article <brady-120...@brady.cs.tcd.ie> br...@cs.tcd.ie

(Michael Brady) writes:
In article <2vs7p9$i...@agate.berkeley.edu>, timo...@ICSI.Berkeley.EDU (Tim
Callahan) wrote:

> ...This path is being taken by many, including those working on the


>PRISM project, and Andre DeHon who's at the MIT AI Lab...

I agree completely with this (though I'd appreciate a few of those
references you spoke about :-)).


For the work being done here check out:

http://www.ai.mit.edu/projects/transit/transit_home_page.html

or if you don't do WWW [if you should if you can]

ftp://transit.ai.mit.edu/
--
"This sort of reasoning is the long-delayed revenge of people who
could not go to Woodstock because they had too much trig homework."
-- Stewart A. Baker, Chief Counsel for the NSA, on cryptio anarchy,
Wired, June 1994

Kevin Simonson

unread,
Jul 12, 1994, 5:04:15 PM7/12/94
to
Some time ago I sent out a posting on fast hardware sorting algorithms
and then just recently I posted on one algorithm in particular. Thanks to
everybody who responded. I got exactly the things I needed.

The first time around somebody either posted an article or sent me
mail to the effect that industry wasn't all that interested in fast hard-
ware sorters because all real sorting applications involved reading data
from disk and writing it sorted back to the disk, and because of that the
real bottleneck in the process was the time spent on I/O.

However, one of the responses I got the first time around seemed to
indicate that the I/O problem can be overcome. Sam Paik sent me an article
titled "AlphaSort: A RISC Machine Sort" by Nyberg et al that described a
process they called striping.

Page seven says, "It should now be clear how we solved the IO bottle-
neck problem--we striped the data across many discs to get sufficient band-
width."

I may have this concept wrong, but I think that what the authors are
saying is that with x-wide striping one would really have (at least?) x
disks and the address space is divided up so that the first cache-sized
block's location is on the first disk, the second is on the second, the
third is on the third, and so on until the xth block's location is on the
xth disk. Then the (x+1)th block's position is back on the first disk
again, the (x+2)th's is on the second, and so on.

If x is a power of two, then which disk to go to to read or write can
be determined by looking at the first few bits of the address, so it
doesn't look like this would be enormously hard to work out. And a big
plus would be that disk accesses could be done largely in parallel.

This might be expensive to implement, though Nyberg's group didn't
mention anything extraordinarily expensive about the "8-wide striping" they
used in their project. It's probably naive to assume that 16-wide striping
would give you twice as fast I/O, or that 64-wide would give you eight
times as fast, but wouldn't those figures be in the right ballpark? And,
as far as expense goes, can anyone think of reasons why somebody couldn't
implement a machine with 256 disks connected in parallel, or even a
1024-wide striping?

At some point in here, unless I'm mistaken, we're going to reach a
point where data is arriving at the hardware at speeds fast enough that a
fast hardware sorter can be kept busy. Or do we run out of money first,
trying to build the thing?

On page 1 of the same paper the authors say, "In 1985, a group of 25
database experts from a dozen companies defined three basic benchmarks to
measure the transaction processing performance of computer systems." One
of those is "a disk-to-disk sort of one million, 100-byte records. This
has become the standard test of batch and utility performance in the data-
base community .... Sort tests the processor's, I/O subsystem's, and ope-
rating system's ability to move data."

On the other hand, Simon Moore pointed me to an algorithm designed by
Chen, Lum, and Tung; and improved upon by Ahn and Murray; that apparently
sorts in optimal time. It reads data in as fast as it can be read in, and
as soon as it's all read in, it outputs it in sorted order.

So I'm having a little trouble reconciling all this information. The
AlphaSort paper refers to a benchmark that everybody's trying to beat, and
yet if we get striping to a high enough degree we can read data from disk
and write it back to disk at hardware speeds, and while we're reading it in
and writing it back out, Ahn and Murray's algorithm is sorting it.

Granted we might not ever exactly reach hardware speeds, since if
we're striping 1024-wide (or more) we'll need some kind of a bus to bring
the data to the sorter, and I think that slows things down. But it still
looks like we have the potential of being precariously close to the optimal
solution.

So I guess I'm curious as to why people haven't tried to use these
concepts to implement Ahn and Murray's algorithm to beat the benchmark.

---Kevin Simonson

Mark Thorson

unread,
Jul 12, 1994, 6:53:53 PM7/12/94
to
In article <2vmu6p$f...@vixen.cso.uiuc.edu>,
Scott Pakin <pa...@indigo-rilla.cs.uiuc.edu> wrote:
>First, density is not the only issue. Speed is also important. I'd guess that
>an FPGA-based CPU would be about an order of magnitude slower than the
>equivalent in custom VLSI. You claim that the gain would be in parallelism.

One of the features of FPGA's of the future is that they will include
optimizations for implementing integer architectures. We see that today,
in FPGA's that can support two counter bits per logic block, and carry
generate-propagate networks for accelerating addition. Future FPGA's
will have both Boolean and integer sections. The speed of carry within an
integer section will be about the same as for a full custom implementation.

I would guess that the integer sections will be bands, 8 bits wide,
interspersed with Boolean sections 5 or 6 bits wide. Probably, the
first FPGA soft microprocessors will have 8 of these integer Boolean/byte
lanes, so that 64-bit processors can be implemented at competitive clock
rates. The same architecture could, of course, be used to implement
parallel 32-bit functions, for example for a DSP architecture.

>That's fine when your application has a lot of parallelism. But what about
>sequential sections of your code? Those will run about 10 times slower than on
>a general-purpose CPU. And don't forget about the time it takes to download a
>new design. Let's say, optimistically, that it takes 1/4 second to download a
>microprocessor to an FPGA. On a 200MHz processor with CPI=1.0, that's the
>equivalent of 50 million instructions. So if your code looks like this:
>
> begin loop
> sequential section <-- 10x slower than VLSI
> download parallel section <-- +50M instructions on VLSI
> parallel section <-- ???x faster than VLSI
> download sequential section <-- +50M instructions on VLSI
> end loop
>
>there's an amortized cost issue. Your parallel section had better be *really*
>big and almost completely parallelizable in order to see an improvement over
>the hardwired microprocessor. Remember, if an FPGA-based microprocessor is ten
>times slower than a four-way superscalar hardwired machine, it had better
>sustain a minimum of 40-fold parallelism. In addition, if the parallel section
>performs a fairly common operation (e.g. FFT), it can probably be done pretty
>well already with a DSP chip.

You're assuming 10X performance degradation for a soft architecture. I would
guess that using hybrid Boolean/integer FPGA's, the performance degradation
would be about 1.2X to 1.5X.


>But let's assume that you write a program that's completely parallelizable,
>unique in operation, and can be handled by a single custom-designed
>microprocessor (i.e. you need only one FPGA download). If there's anyone else
>on the system who also has a completely parallelizable program that can be
>handled by a single, but *different*, custom-designed microprocessor, you'll
>both be killed on context switches. Speaking of which, how do you write an
>operating system for a microprocessor that keeps changing? You'd have to
>download a program to the FPGA for the OS. There goes another 100M
>instructions you could have executed on a general-purpose hardwired
>microprocessor.

Why would anyone else be trying to run a program on the computer you're
using? I don't understand that. Do you think 21st-century computers
will be doing multi-user time-sharing?

WRT context switches, I expect future FPGA's to have multiple sets of
the SRAM bits which control the interconnect and the functionality of the
logic blocks. There would be control bits or signals controlled by
hardware which would select which set of programming is used. For example,
let's say a modern Xilinx FPGA has 20 control bits per logic block.
A future FPGA might have an 8 x 20 SRAM array for holding these bits.
This allows you to have eight separate hardware contexts.

Each set of bits is a separate machine model. Each is a hardware context.
One context would probably be reserved for a 'standard' machine model,
used to run the OS, execute device drivers, etc.

>So let's say that there will be only one program ever running on the machine,
>just like in the old batch-processing days. Or maybe that the FPGA is so
>incredibly massive that you can fit the hardware for every section of every
>user's code. (In the latter case, a hardwired microchip could probably use the
>same quantity of chip real estate for a bigger cache or another processor.)
>Basically, let's pretend that the download time is negligible. Oh, yeah: We
>also have to pretend that we can fit a big chunk of each program's memory on
>the FPGA, too, or we'd be killed by memory bandwidth limitations. Can we
>realistically compile most of the programs we need to run such that they'd run
>faster on an FPGA-based system than on a hardwired system? Can we debug them?
>Do we have to train hordes of single-language software grunts in hardware
>design? (Ok, so that last question does not raise a technical issue, but it's
>still worth considering.)

Have you ever seen a manual for programming the patch panel of an old
punch-card based electromechanical data processor? The programming of
these systems was _nothing_ like programming a digital computer. Actually,
it's not that unsimilar from what programming future FPGA-based machines
will be like.

Enormous transformational paradigm shifts like this come along from time
to time, and they represent great opportunity for the innovators in our
industry. Great fortunes will be made, and great fortunes will be lost.
The losers don't even realize what's coming, so they won't be able to
mobilize to oppose it. The innovators see the future, and will be
motivated by the enormous profit to those who get there first.

The idea of throwing away a generation of computer programmers who've
spent years studying superscalar, superpipelined, branch prediction,
etc. is very appealing. That's how we weed out the boneheads, the
workaday schmucks who went into computers just because they thought it'd
be a good living. The smart guys, the hackers, will take to the new
technology like a duck to water.

>Don't get me wrong; I think FPGAs are really neat and, as yet, underutilized.
>I just see the flexibility vs. performance issue as too overwhelming for FPGAs
>to make less flexible designs obsolete. While by no means an original idea,
>one alternative to replacing VLSI with FPGAs is to implement the main processor
>in custom VLSI but add an FPGA co-processor. That way, sequential and
>marginally-parallel code can execute on one and massively-parallel code can
>execute on the other. I believe there's a group in France that's even writing
>compilers for that sort of machine.

I anticipate there will be an on-chip processor of some sort, probably an
enhanced 8086 architecture. But it won't have any performance-critical
responsibilities. It'll boot up the FPGA, run the power-on self-test,
and handle the interface to the 8259 interrupt controller. It might also
handle power-down modes, like shutting down sections of the FPGA array.

Henry G. Baker

unread,
Jul 13, 1994, 12:03:11 AM7/13/94
to
In article <2vrrdc...@early-bird.think.com> ch...@Think.COM (David Chase) writes:
>pre...@noel.cs.rice.edu (Preston Briggs) writes:
>|> You gave a time frame of this decade. I'm hoping, perhaps
>|> optimistically, that we see some of the current research coming out in
>|> production form by the end of the decade. Things just don't happen
>|> quickly enough for a grand revolution in compilation and architecture,
>|> of the sort you imply, to be accomplished in 5 or 6 years.
>
>Ah, but this is only true because the people with cash to fund
>such efforts are too chicken to risk it. You could probably name
>10 sharp compiler people you want to have working with you
>if you planned to do this, and you can probably imagine that each
>of those people could be convinced with 5 years of employment at
>a particular salary -- it's hard, sure, but it's possible. At the
>outside, you're talking $5 million per year (500,000 per person,
>per year), for 5 years -- to develop a new technology. Compare
^^^^^^^

>that with what people spend on chip design for existing architectures,
>and that isn't too bad.
>
>Of course, the hard part is convincing the people with money that
>you've got the right 10 people, and that this is the best place
>to put their money (best return on investment), and that
>something will actually come of it. They're not interested
>in paying 10 people to play for 5 years with nothing to show.
^^^^^^^

Actually, it's much worse that that. Most VC types aren't interested
in development projects that take more than 2 years. One could could
be charitable, and say that they don't want to take so much risk. Or
one could be cynical and say that they don't want to go back to the
B-School 5-year reunion w/o a big killing (IPO) under their belt.

To be bluntly honest, though, most 2 year development projects take 5
years. (A CS person's reach should exceed his grasp, or what's a
cram-down financing for?)

Anders Hedberg

unread,
Jul 13, 1994, 9:11:52 AM7/13/94
to

>-Jeff.

I thought von Neumann computers had shared data and program memory?
What has this to do with programmable hardware?

(Right now playing with a processor hosted in an ORCA FPGA... If I had the time
I could build a processor that uses just as much hardware as the compiler thinks
a program would need... The compiler produces assembler language like MOVE, ADD & AND.
Then I just collect MOVE, ADD & AND predefined blocks of logic and let the ORCA tool's
melt it together. The assembler produces the required machinecode for my new processor.
Voil'a! )

/Anders , who is dreaming of nice things to do if time was forever...
--
/Anders Hedberg d89...@ludd.luth.se Lulea, Sweden
I'm just another student, please correct my english !

Hugh LaMaster

unread,
Jul 13, 1994, 1:44:32 PM7/13/94
to
In article <eeeCsu...@netcom.com>,

e...@netcom.com (Mark Thorson) writes:
|> In article <2vmu6p$f...@vixen.cso.uiuc.edu>,
|> Scott Pakin <pa...@indigo-rilla.cs.uiuc.edu> wrote:
|> >First, density is not the only issue. Speed is also important. I'd guess that
|> >an FPGA-based CPU would be about an order of magnitude slower than the
|> >equivalent in custom VLSI. You claim that the gain would be in parallelism.

This factor, 10:1, seems reasonable to me overall. The
big payoff would be doing things in hardware, with high
utilization of the hardware, which otherwise would have
to be done in software at low efficiency. e.g. DSP-like
operations. I think of an FPGA device as being a
generalized DSP.

[Some material by Mark Thorson deleted. Summary:
Future FPGA-based CPUs will be able to run serial code
at decent speed due to certain features.]

I hope so, because this will be necessary for widespread use.

[Scott Pakin's "Amdahl's Law"-type explanation deleted.]

Agreed. FPGA-based CPUs will have to be able to run serial code
reasonably fast, or they will be stuck doing DSP-like things.
Everyone seems to be agreed on that. Mark Thorson is more
optimistic about how good serial performance will be:

Mark Thorson sums up target for performance:


|> You're assuming 10X performance degradation for a soft architecture. I would
|> guess that using hybrid Boolean/integer FPGA's, the performance degradation
|> would be about 1.2X to 1.5X.

On to the next issue: Time-sharing and context switches.


|> Scott Pakin <pa...@indigo-rilla.cs.uiuc.edu> wrote:
|> >But let's assume that you write a program that's completely parallelizable,
|> >unique in operation, and can be handled by a single custom-designed
|> >microprocessor (i.e. you need only one FPGA download). If there's anyone else
|> >on the system who also has a completely parallelizable program that can be
|> >handled by a single, but *different*, custom-designed microprocessor, you'll
|> >both be killed on context switches. Speaking of which, how do you write an
|> >operating system for a microprocessor that keeps changing? You'd have to
|> >download a program to the FPGA for the OS. There goes another 100M
|> >instructions you could have executed on a general-purpose hardwired
|> >microprocessor.

*Obviously* ["said the professor" :-) ], you will have at least
two CPUs, one general-purpose [e.g. a "traditional" RISC
microprocessor-based] CPU, and one specialized FPGA-based CPU.
The operating system will run on the general purpose system.

|> Why would anyone else be trying to run a program on the computer you're
|> using? I don't understand that. Do you think 21st-century computers
|> will be doing multi-user time-sharing?

*Of course they will.* Lots of machines will be shared, because
that will serve people's needs for sharing data. And, even on
dedicated "personal computers", all OS's by the 21st century will
be multiprogramming. Even Macs. [Of course, DOS will still be
around. But it won't be what people use for, e.g. multimedia].
I'm amazed that over 10-20 years [depending on how you are counting]
of PC's, anyone could still suggest monoprogrammed systems as
viable even for single users. The evidence and tide of history
is completely against this. Just look at the gyrations that
people have gone through to simulate multiprogramming on DOS
and the Mac.

Even on totally isolated single user machines, don't you imagine
that the user might have two different programs going at the
same time in two different windows?

And, then there is that pesky problem of how to get enough
of the devices out in a marketplace full of working
applications.

Hence, the FPGA-based CPU initially will probably look like
an I/O device, e.g., like a tape drive, which only one user
can access at a time. Probably, initially, it will be linked
to the console, so that the console user has access to it,
the way certain DSPs are handled. Then, it will become
an allocated device. But, eventually, it will have to become
time-shared. However, the average time slice could
be expected to be fairly long, probably 10-40 ms minimum,
and 100-300ms would be OK in a lot of cases. If you could
context-switch/reprogram in less than 100 ms you will
probably do OK at first, and 10 ms would be fine for
most uses.

The key here is to remember the most demanding case may
be on a single-user multimedia system. Imagine realtime
video in several windows simultaneously. Imagine several
processes needing to share the FPGA-CPU, and imagine you
want smooth motion. You could easily end up needing 70-100
context switches/second.

|> Enormous transformational paradigm shifts like this come along from time
|> to time, and they represent great opportunity for the innovators in our
|> industry. Great fortunes will be made, and great fortunes will be lost.
|> The losers don't even realize what's coming, so they won't be able to
|> mobilize to oppose it. The innovators see the future, and will be
|> motivated by the enormous profit to those who get there first.

|> The idea of throwing away a generation of computer programmers who've
|> spent years studying superscalar, superpipelined, branch prediction,
|> etc. is very appealing. That's how we weed out the boneheads, the
|> workaday schmucks who went into computers just because they thought it'd
|> be a good living. The smart guys, the hackers, will take to the new
|> technology like a duck to water.

This approach will probably be an impediment to progress.
You need to be thinking about what the minimum steps
necessary are to get one of these things off the ground,
and generating self-sustaining revenue:

1) You need OS support. Unix, Windows/and/or/NT, Mac.
2) You need hardware support. SCSI first, because one
size fits all. Then ISA, NuBus, PCI, VLB, EISA, Sbus,
Turbochannel, SCI, Futurebus+, ... .
3) You need a programming interface. Since this is supposed
to be running whole processes, you should develop interfaces
for the more prevalent RPC interfaces out there -
ONC [Sun RPC], NT/whatever, DCE, whatever the Mac uses ...
4) You need host compilers, running on the same Unix, PC,
and Mac boxes.
5) You need a set of good demonstration applications
showing how the device can speed up the users applications.
Everybody loves spreadsheets. How about a "client-server"
spreadsheet with a GUI front-end and a really fast back-end
running on the FPGA? And a multimedia application of some
kind - FPGAs could potentially be good at doing JPEG/etc.
decompression. And, the applications need to be *a lot*
faster on the FPGA-based CPU.
6) Then, you need to look at the performance of RISC
CPUs in the pipeline, and move fast to beat their
performance as soon as they become available.

|> >Don't get me wrong; I think FPGAs are really neat and, as yet, underutilized.
|> >I just see the flexibility vs. performance issue as too overwhelming for FPGAs
|> >to make less flexible designs obsolete. While by no means an original idea,
|> >one alternative to replacing VLSI with FPGAs is to implement the main processor
|> >in custom VLSI but add an FPGA co-processor. That way, sequential and
|> >marginally-parallel code can execute on one and massively-parallel code can
|> >execute on the other. I believe there's a group in France that's even writing
|> >compilers for that sort of machine.

I agree. I think that the FPGA-based CPUs will first be used by
people needing massive speedups on DSP-like codes. But why
haven't DSPs been more popular? Too specialized. You will
have to find a significant class of codes which are ~4X faster
on the FPGA than they would be on the latest 6-8-way-issue
superscalar 550 Mhz CMOS microprocessor, and which are not
suitable for existing DSPs.

|> I anticipate there will be an on-chip processor of some sort, probably an
|> enhanced 8086 architecture. But it won't have any performance-critical
|> responsibilities. It'll boot up the FPGA, run the power-on self-test,
|> and handle the interface to the 8259 interrupt controller. It might also
|> handle power-down modes, like shutting down sections of the FPGA array.

Enhanced 8086?!?! I disagree completely, as I have stated
above. Users require high performance multiprogramming OS's,
and they connection to high-speed networks, I/O, and graphics.
It would be a total waste to put a high-performance FPGA in a
system with no way to feed it. I want to see the FPGA be
integrated with a 1000 SPECint, 2000SPECfp, 2000 MFLOPS,
superscalar/vector-CPU general purpose workstation, with a
1 Gbits/sec ATM network interface. Then I want it to run
some interesting applications at least 4X faster than it will
run locally or over-the-network somewhere. That will be the
competition in 3-5 years.

--
Hugh LaMaster, M/S 233-9, UUCP: ames!lamaster
NASA Ames Research Center Internet: lama...@ames.arc.nasa.gov
Moffett Field, CA 94035-1000 Or: lama...@george.arc.nasa.gov
Phone: 415/604-1056 #include <std_disclaimer.h>

Jeff Cunningham

unread,
Jul 13, 1994, 7:53:41 PM7/13/94
to
In article <300p6o$o...@mother.ludd.luth.se>
d89...@ludd.luth.se (Anders Hedberg) writes:

> I thought von Neumann computers had shared data and program memory?
> What has this to do with programmable hardware?

I think what people here mean here by non-von Neumann is a more data
flow or neural net-like implementation on an FPGA. Perhaps strickly
speaking a harvard architecture is a non-von thing, but for all
practical purposes it has the same limitations. It has the same
processor - memory sequential execution bottleneck, only the bottleneck
is twice as wide. There are approaches that do away entirely with this
bottleneck, but the trick is mapping well known algorithms onto this
architecture, or coming up with a new programming paradigm that
exploits the architecture's capabilities and doesn't take 10 times as
long to program. The new type of computers being discussed here would
not have a program memory as such; rather the "program" is defined by
the interconnections and programming of the FPGA. This is sort of like
it is with neural nets - the programming is defined by the
interconnections between a lot of simple processing elements.

-Jeff.

Jeff Cunningham

unread,
Jul 13, 1994, 7:57:35 PM7/13/94
to
In article <eeeCsu...@netcom.com>
e...@netcom.com (Mark Thorson) writes:

[...]

> I anticipate there will be an on-chip processor of some sort, probably an

> enhanced 8086 architecture....
^^^^

Yikes, Lord help us! Where'd I put that wooden stake and silver
cross... :-)

-Jeff.

Todd P. Whitesel

unread,
Jul 14, 1994, 5:34:36 AM7/14/94
to
e...@netcom.com (Mark Thorson) writes:

>I anticipate there will be an on-chip processor of some sort, probably an
>enhanced 8086 architecture. But it won't have any performance-critical
>responsibilities. It'll boot up the FPGA, run the power-on self-test,
>and handle the interface to the 8259 interrupt controller. It might also
>handle power-down modes, like shutting down sections of the FPGA array.

Is this a revolutionary machine or just a VESA card or something?

Todd Whitesel
toddpw @ ugcs.caltech.edu

David Chase

unread,
Jul 14, 1994, 1:18:13 PM7/14/94
to
Pardon my naivete, but what's the typical interface to an FPGA? Aren't they just
memory-mapped devices that you write to, wait a bit, and an answer pops out in
another piece of memory? (That is, after you have programmed them.)

(Assuming I'm correct) In the case of multiple processes, couldn't it be the case that
FPGAs are allocated and mapped (and protected) much like virtual memory is today?
What's the cost of a usefully-sized FPGA, compared to the cost of a usefully sized
piece of ordinary memory? You could even have (perish the thought) something like
shared FPGA libraries. If there's two processes that need access to video, why not
devise mechanisms that allow them to share a single chunk of an FPGA (obviously,
they cannot share it at exactly the same time, but they should be able to time-slice).

David Chase, speaking for myself
Thinking Machines Corp.

Donald Lindsay

unread,
Jul 14, 1994, 5:02:49 PM7/14/94
to

In article <941931...@mulga.cs.mu.oz.au>,

Thomas Charles CONWAY <con...@munta.cs.mu.OZ.AU> wrote:
>Would anyone like to comment on the experience of the Lisp machines
>and Prolog machines with respect to this discussion?
>
>I don't know the details, but it seems that in the time it took to
>implement a novel processor, conventional processors had gained
>enough speed to more than compensate....

A Lisp machine doesn't know what program you will run: it only knows
that you wrote it in Lisp. This gives a Lisp machine a certain amount
of leverage, but not (say) 10x.

A configured FPGA machine is specific to a single program. Given
sufficient cleverness, portions of the program need not be
represented by conventional instructions: they can be spread into the
hardware structure. The programmable hardware can in theory become
state machines, data pipelines, content-addressable RAM, and so on.
Always permute your data? Hey, the data path can have a permutation
network: why not?

Under ideal circumstances, this advantage can be 10,000x. So, the
interesting question is: should the mass-market chips pick up on
this? The answer depends partly on what the mass-market system will
need done - for example, encryption or video compression. And it also
depends on the system architecture: if the CPU is memory-limited when
doing software encryption, then an accelerator will run no faster.
--
Don D.C.Lindsay Carnegie Mellon Computer Science

Jeff Cunningham

unread,
Jul 14, 1994, 8:44:55 PM7/14/94
to
In article <303s0l...@early-bird.think.com>
ch...@Think.COM (David Chase) writes:

Consider Xilinx FPGAs: They can be loaded in several modes, the most
appropriate in this case would be peripheral mode, which is a single
memory mapped location that the CPU writes bytes to. It takes several
microseconds for the xilinx to swallow and digest each configuration
byte, so downloading programs to them is orders of magnitude slower
than doing DMA from a disk, for instance. Since reconfigurable
computers have multiple Xilinxes, so you could play clever games with
interleaving their configuration streams to get the bandwidth from the
CPU up, but approaching I/O rates of swapping in today's virtual memory
OS's would be quite difficult. Today's FPGAs are sold as logic devices
(a huge and incredibly fast growing market by the way) where the speed
of loading them is not all that important. I imagine that it would be
straightforward to design FPGAs with downloading time as fast as any
memory (with the usual tradeoffs in other areas such as power
consumption, size (i.e. cost) of the chip, etc.).

As far as the cost of FPGAs, they range from $10 to many $100's. A
Xilinx 4025 (their latest) probably costs well over $600 and has 2560
single bit data storage elements (not counting the configuration or
program storage), a thousand or two simple (4 or 5 input) boolean logic
functions, and 256 IO pins for wiring up to dedicated memories, a host
processor or other Xilinxes. A low end 3020 costs $10 and has 256 bits
of storage, around 100 simple boolean logic functions and 64 IOs. I
have seen configurable computers made from lots of 3064s (I think) that
cost around $40 each have almost 4 times as much logic and storage as a
3020. Each Xilinx in this computer I recall was connected to its own
32K x 8 static RAM memory. Bear in mind that it is hard to get over
80-90% useage of logic elements on a Xilinx because of problems
optimizing the programmable routing that connects all the elements
together.

As far as the cost of memories, a 16 Mbit DRAM chip costs around $80
and has a cycle time of around 120ns. A fast static RAM, say a 256 Kbit
35ns chip can be had for $5-10. Larger and faster ones (e.g. 4 Mbit
35ns, 256 Kbit 7ns) are available for significantly more money (like
$100 and up). So, cheap memory = .0005 cents/bit, fast memory = .05
cents/bit is a good approximation.

Jeff.

HALLAM-BAKER Phillip

unread,
Jul 15, 1994, 8:04:25 AM7/15/94
to

Last I heard he had been dead for years.....

|>Yikes, Lord help us! Where'd I put that wooden stake and silver
|>cross... :-)

Hmm. one must always seek to cover all eventualities I suppose.

--
Phillip M. Hallam-Baker

Not Speaking for anyone else.

Chuck Narad

unread,
Jul 15, 1994, 2:56:37 PM7/15/94
to

burroughs also sold one in that timeframe, I think it was
called the B1000 or B1200? anyway, it was a nanocoded engine,
far as I know it was only used as a research vehicle.

cheers,
chuck/

-----------------------------------------------------------
| Chuck Narad -- diver/adventurer/engineer |
| |
| "The universe is full of magical things, patiently |
| waiting for our wits to grow sharper." |
| |
| -- Eden Phillpotts |
| |
-----------------------------------------------------------

Robert Herndon

unread,
Jul 15, 1994, 5:38:52 PM7/15/94
to

I think Mark Thorson's prediction of the coming extinction of
general-purpose microprocessors is very premature.

At the moment, we don't even see anything vaguely resembling what
he predicts will become the dominant species in the computing ecosystem.
True, one can get various commodity micros customized with a few
instructions, at reasonable-to-expensive prices. There are even an
awful lot of those micros out there. There are not that many _types_,
however.

I think the real problem here is simply complexity. If we had
anything like the suggested architectures, we could simply add a few
instructions -- David DesJardins would add in a set of low-precision
floating point operators, Herman Rubin would add in a set of integer
arithmetics, I'd add in leading-zero-count and population-count, and
I can imagine other folks adding bias-N arithmetic, sign-magnitude
math, DES ECB/CBC encryption/decryption, etc., and everybody might be
happier.

Unfortunately, actually formally describing the operations desired,
and doing it right, takes hard work, and so does realizing that formal
description in (virtual) silicon. I don't imagine that compilers are
going to be taking the initiative for that sort of thing anytime soon
-- at least not any compiler that looks much like compilers do now.
(Maybe PL/I: if it sees enough funny accuracy types declared, it could
add in appropriate FP/integer ops?)

Automating the extraction of operations (other than the obvious ones
for bit sets, floating point numbers, integers, etc.,) that would make
computations go faster seems, to me, to be beyond the realm of reasonable.
I can imagine that if a peep-hole optimizer saw lots of specific operation
sequences, it might try to combine operations, but that would be about
the limit. Even then, if it can't find sequences longer than three or
four instructions, I think there is little hope for improved speed.

I can readily imagine such processors being built, and I would even
like to see them. But the idea reminds me much of the writable control
stores of some '70s and '80s CPUs -- a great idea -- but only a few
people used it. (How many people or sites do you know of that used
anything other than stock VAX microcode in their VAXen? For anything
other than educational purposes?) And the performance gains in general-
purpose micros soon obsoleted any real performance improvements.

Don't get me wrong -- I think there will likely be a niche for this
type of configurable microprocessor. It already exists, and is reasonably
populated, although I'm not aware of any members that can be dynamically
configured. But I do not think it will become a dominant species, much
less push general-purpose micros into extinction.

Robert Herndon
(My opinions only.)

Mark Rosenbaum

unread,
Jul 16, 1994, 1:18:25 PM7/16/94
to
In article <1994Jul12.2...@beaver.cs.washington.edu>,

Kevin Simonson <simo...@cs.washington.edu> wrote:
> Some time ago I sent out a posting on fast hardware sorting algorithms
>and then just recently I posted on one algorithm in particular. Thanks to
>everybody who responded. I got exactly the things I needed.
>
> The first time around somebody either posted an article or sent me
>mail to the effect that industry wasn't all that interested in fast hard-
>ware sorters because all real sorting applications involved reading data
>from disk and writing it sorted back to the disk, and because of that the
>real bottleneck in the process was the time spent on I/O.
>
> However, one of the responses I got the first time around seemed to
>indicate that the I/O problem can be overcome. Sam Paik sent me an article
>titled "AlphaSort: A RISC Machine Sort" by Nyberg et al that described a
>process they called striping.

Real world sorting has many more requirements than what you have
mentioned. Some of these include checkpoint restart, inplace vs
copied, and stability. If you are sorting your data for several
days and the system goes down, as we all know they will, you
do not want to start over from the begining. If your data is
several terabytes you may not want to have to copy it in order
to sort it so in place sorting would be nice. Finally if two
records have identical keys but are different records you may or
may not want them to be in the same order relative to each other
after sorting. Additionally the sort may not be disked based but
may in fact be tape based.

>
> Page seven says, "It should now be clear how we solved the IO bottle-
>neck problem--we striped the data across many discs to get sufficient band-
>width."
>
> I may have this concept wrong, but I think that what the authors are
>saying is that with x-wide striping one would really have (at least?) x
>disks and the address space is divided up so that the first cache-sized
>block's location is on the first disk, the second is on the second, the
>third is on the third, and so on until the xth block's location is on the
>xth disk. Then the (x+1)th block's position is back on the first disk
>again, the (x+2)th's is on the second, and so on.
>
> If x is a power of two, then which disk to go to to read or write can
>be determined by looking at the first few bits of the address, so it
>doesn't look like this would be enormously hard to work out. And a big
>plus would be that disk accesses could be done largely in parallel.
>
> This might be expensive to implement, though Nyberg's group didn't
>mention anything extraordinarily expensive about the "8-wide striping" they
>used in their project. It's probably naive to assume that 16-wide striping
>would give you twice as fast I/O, or that 64-wide would give you eight
>times as fast, but wouldn't those figures be in the right ballpark? And,
>as far as expense goes, can anyone think of reasons why somebody couldn't
>implement a machine with 256 disks connected in parallel, or even a
>1024-wide striping?

I believe what you are discribing here is RAID 1. I never can remember
the RAID numbers but disk striping has been around for many years now.

In the real world things like memory and disk caching (sp) can have
more impact on run times than slight improvements in alogrithms.
Also while sorting is important there probably is not enough of
it to justify the investment that would be need to keep up with
standard micros. Look at the LISP machine market as an example.

>
> ---Kevin Simonson

Hope this helps

mjr

Vernon Schryver

unread,
Jul 16, 1994, 10:12:44 AM7/16/94
to
In article <eeeCsu...@netcom.com> e...@netcom.com (Mark Thorson) writes:
> ...
>Have you ever seen a manual for programming the patch panel of an old
>punch-card based electromechanical data processor? The programming of
>these systems was _nothing_ like programming a digital computer. Actually,
>it's not that unsimilar from what programming future FPGA-based machines
>will be like.

I think that is entirely wrong. I think programming those things was
very much like assembly language or perhaps microcode programming, and
rather like hand coding for modern machines where all of the guts are
exposed. (I saw and touched patch panels, not manuals for them, but
not much, and not for about 22 years.)

>Enormous transformational paradigm shifts like this come along from time
>to time, and they represent great opportunity for the innovators in our
>industry. Great fortunes will be made, and great fortunes will be lost.
>The losers don't even realize what's coming, so they won't be able to
>mobilize to oppose it. The innovators see the future, and will be
>motivated by the enormous profit to those who get there first.
>
>The idea of throwing away a generation of computer programmers who've
>spent years studying superscalar, superpipelined, branch prediction,
>etc. is very appealing. That's how we weed out the boneheads, the
>workaday schmucks who went into computers just because they thought it'd
>be a good living. The smart guys, the hackers, will take to the new
>technology like a duck to water.

> ....

I'm sorry, but the last chance that could have happened in the computer
industry was at least 10 years, and probably 15 or 20 years ago. The
industry is far too big and "mature" now. There are too many marketting
mavens of nonsense and career Information Services executives to allow
genuine changes to happen in less than a generation (30 years).

This thread is an example of that sad fact. A straight text substition
of "functional programming" for "FPGA" in this thread makes it identidal
to threads that appeared in comp.arch and the trade press long ago.
None are complete nonsense, but they are the same as The Advances That
Will Soon Revolutionize Technology that appear in "Popular Science" and
"Scientific American". Remember inertially confined fusion, where very
large lasers were to explode the equivalent of 15 kg of TNT every few
seconds, generating a neutron burst in the middle of a 20 foot deep
vortex of molton lithium, about 30 feet from the gadget that was freezing
and dropping the pellets of frozen hydrogen down the middle of the
vortex? Without any induced radioactivity in anything except the litium
and where everthing would just keep on working for decades without
cracking or otherwise being worn out by 10,000,000's of artillery shell
equivalents going off in the middle of things?

Functional programming, general purpose FPGA computers, and that kind
of fusion are similar. They work fine on paper, in the popular press,
grant requests, and on seminar overhead projectors, provided you make a
few simplifying assumptions about minor implementation difficulties.

90% of all programmers have never heard of "superscaler, superpipelined,
brand prediction, etc." and 99% could not define 3. Practically all
programmers are work-a-day people who use COBOL, Fortran, C, C++, MUMPS,
assembly language, and RPG, and neither know nor care how things work.
That has been true for more than 30 years. It is not going to change
significantly in the next 60 years. Even if the machines become
intelligent, someone (something) will still have to describe what needs
to be done to compute our (and their) payroll taxes and Von Neumann
programming languages are the best things we'll have for specifying and
commicating the specifications of algorithms and processes for a long
time.


Vernon Schryver v...@rhyolite.com

Othman bin Hj. Ahmad

unread,
Jul 17, 1994, 6:39:11 AM7/17/94
to
r...@craycos.com (Robert Herndon) writes:

>
> I think Mark Thorson's prediction of the coming extinction of
> general-purpose microprocessors is very premature.
>
> At the moment, we don't even see anything vaguely resembling what
> he predicts will become the dominant species in the computing ecosystem.
> True, one can get various commodity micros customized with a few
> instructions, at reasonable-to-expensive prices. There are even an
> awful lot of those micros out there. There are not that many _types_,
> however.

What do we mean by general purpose here? I'd predicted that there will
be disaggrements as to what amounts to general purpose.

My definition, no doubt biased towards certain applications, is that
general purpose equates to easy programming using high-level languages,
which means that conditional and branching instructions are the minimum
requirements.
Therefore general purpose micros are more suitable to solving
problems requiring intelligence.

>
> I think the real problem here is simply complexity. If we had
> anything like the suggested architectures, we could simply add a few
> instructions -- David DesJardins would add in a set of low-precision
> floating point operators, Herman Rubin would add in a set of integer
> arithmetics, I'd add in leading-zero-count and population-count, and
> I can imagine other folks adding bias-N arithmetic, sign-magnitude
> math, DES ECB/CBC encryption/decryption, etc., and everybody might be
> happier.

These problems are best solved using specific micros with little need
for branches. If we have these dedicated instructions, who needs
additional programming, but we usually do need simple programming.

> Don't get me wrong -- I think there will likely be a niche for this
> type of configurable microprocessor. It already exists, and is reasonably

What do we mean by reconfigurable?


> populated, although I'm not aware of any members that can be dynamically
> configured. But I do not think it will become a dominant species, much
> less push general-purpose micros into extinction.

It may survive as the cores in configurabale microprocessors. AFTER all,
we still need branches even if it were configurable.
Incorporating conditional branches into random logic may be too
expensive IMHO.


Disclaimer: I only speak for myself.

SABAH is Heaven.

David M. Koppelman

unread,
Jul 18, 1994, 1:26:52 PM7/18/94
to
An FPGA processor might be useful for digital TV sets. The television
signal could periodically contain configuration information for the
processor. The FPGA could be changed frequently (with each scene,
commercial, etc.) or infrequently (every few months). This would be
useful because an FPGA would be faster than a general purpose
processor for video signal processing. Of course, a digital TV would
use a specialized made-for-TV signal processor. But because
compression technology might advance fast such a signal processor
might quickly become obsolete. An FPGA, in contrast, could be
routinely reprogrammed. This might make up for the performance difference
between an FPGA and a fully custom design. Only one problem: imagine
the viruses that a pirate station could implant in your TV. :-)

Christian Iseli

unread,
Jul 19, 1994, 9:57:10 AM7/19/94
to
I have followed this thread with much interest, as my PhD thesis project
involves the design of a reconfigurable processor. These are my $0.02
for this discussion...

> From e...@netcom.com (Mark Thorson)


> By the end of the decade, advances in FPGA density will make
> the von Neumann architecture totally obsolete. People won't
> build hardwired CPU's when they can build soft CPU's that can
> transform themselves into the most appropriate architecture for
> the applications to which they are applied.

I'm not so sure about this: hardwired CPU will still run faster than
the same CPU built in an FPGA. I think a conventional CPU will still
have its uses: to run the OS, to configure the FPGA coprocessor ( :-) ), etc.

I think that the FPGA processor will be used as a coprocessor to
speed up the actual computation of the user applications.

> From e...@netcom.com (Mark Thorson)
> Why will FPGA-based CPU's be faster than hardwired CPU's?
> They will have greater parallelism. At best, superscalar
> techniques seem to result in a real-world performance of
> about two instructions per clock. An FPGA-based CPU,
> however, has no restrictions on the simultaneous operation of
> its parts. It resembles a circuit board more than an
> instruction-list processor.

The problem these days is not as much to make the processor faster
but to feed it at the required speed. That's why we see 64-bit
architectures hitting the market now. However, a conventional
processor has a lot of trouble to handle a 64-bit word of data
in a useful way (adding two 64-bit numbers is not often useful
except for address computation). However, putting several data
elements in a 128-bit word and having an FPGA handle them in parallel
can be a major win.

> From e...@netcom.com (Mark Thorson)
> How will this change the programmer's interface? Compilers
> will need to deal with the machine at a much lower level of
> interface. These compilers will evolve from present-day hardware
> description languages like VHDL. They probably will be visual
> programming languages, making the floorplan of the machine
> accessible to the programmer.

Programming those beasts is the big problem, I think. It would take
an incredibly intelligent compiler to generate a usable and fast
FPGA configuration from a plain FORTRAN or C program, and I'm afraid
we are stuck with these languages for a long time if we want these
FPGA processors in wide use. The secret weapon will probably be
called "extension" though. There could probably be a FORTRAN-99
including extensions for FPGA computers :-)

> From pre...@noel.cs.rice.edu (Preston Briggs)
> I don't agree. Things may change, and FPGAs may become amazing dense,
> but you'll still be waiting on compilers at the end of the decade.
>
> From pre...@noel.cs.rice.edu (Preston Briggs)
> Currently, we obtain great (and I believe necessary) simplifications
> by fixing many pieces of the overall problem. We assume fixed-sized
> integers and pointers, certain FP formats, registers, cache, etc.
> These assumptions give us a standard environment to work with. If you
> throw all these assumptions away and attempt to create optimized
> hardware for each new program, you'll only make the (already
> intractible) compilation problems much more difficult.

I think that's correct.

The approach I have taken to track down the complexity of FPGA
processors is to give them some structure. It's always easier to
reason about something which has a structure. A structure I know
fairly well is the structure of a processor. I think I can reasonably
expect that, at least for some applications, the structure given
to an FPGA processor would not be too much different from the
structure of a conventional CPU. So instead of having a completely
configurable processor, I decided that only the execution units of
the processor should be reconfigurable. The processor is based on
VLIW architecture and microprogrammed processors. There is a
control unit that has a program counter and issues operations through
a microcode memory. There are a (very) large number of registers,
a memory controller and a bunch of execution units. The execution
units are implemented each with an FPGA, and operate only on the
registers. The task of programming this beast is thus divided
into three parts: program operators to put in the execution units,
write the microcode to feed the execution units with data from the
main memory, and write the controlling program running on the
host processor (to feed data to the coprocessor, display the
results, etc.) All these parts can be programmed in a high-level
language with restrictions and/or extensions. I'm currently writing
compilers that use C as the base language. The advantage is that
you go back to known problems of compilation: transforming a subset
of C (to describe combinational or sequential operators) to hardware
is not that difficult and I have a working prototype; compiling
C code to a VLIW architecture is also feasible and compilers to
compile the program for the host processor already exist.

> From con...@munta.cs.mu.OZ.AU (Thomas Charles CONWAY)


> Would anyone like to comment on the experience of the Lisp machines
> and Prolog machines with respect to this discussion?
>
> I don't know the details, but it seems that in the time it took to
> implement a novel processor, conventional processors had gained
> enough speed to more than compensate....

The new thing is that as the conventional processors evolve, so do
the FPGAs because it's the same kind of companies that build them.
If this is done smartly, you get new, faster, bigger, pin compatible
FPGAs which you use to replace your old parts, recompile your
program and it runs (hopefully :-) faster...


If you'd like further information on my research project, here are a
couple references:

Christian Iseli & Eduardo Sanchez, "Beyond Superscalar Using FPGAs",
IEEE International Conference on Computer Design, Cambridge Mass.,
October 1993.

Christian Iseli & Eduardo Sanchez, "Spyder: A SURE (SUperscalar and
REconfigurable) Processor", The Journal of Supercomputing, to be
published in autumn 1994.

Please feel free to contact me if you have trouble locating them.
Any comment welcome.

--

Christian Iseli
LSL-DI-EPFL
Lausanne, Switzerland

Tom Biggs

unread,
Jul 19, 1994, 12:25:42 PM7/19/94
to

In article 6...@rox.craycos.com, r...@craycos.com (Robert Herndon) writes:
>
> I think Mark Thorson's prediction of the coming extinction of
>general-purpose microprocessors is very premature.

> I can readily imagine such processors being built, and I would even


>like to see them. But the idea reminds me much of the writable control
>stores of some '70s and '80s CPUs -- a great idea -- but only a few
>people used it. (How many people or sites do you know of that used
>anything other than stock VAX microcode in their VAXen? For anything
>other than educational purposes?) And the performance gains in general-
>purpose micros soon obsoleted any real performance improvements.
>
> Don't get me wrong -- I think there will likely be a niche for this
>type of configurable microprocessor. It already exists, and is reasonably
>populated, although I'm not aware of any members that can be dynamically
>configured. But I do not think it will become a dominant species, much
>less push general-purpose micros into extinction.
>
>Robert Herndon
>(My opinions only.)


Doesn't the Alpha have a writable control store? Actually it is called
PALcode. I think it allows the programmer to create custom instructions, with
complete control of the chip's resources. I am not sure of the details, but I
think it was created so that the Alpha could emulate the VAX instruction set.

Does anyone actually use the PALcode for speeding up algorithms, or is this
even possible?


-tom
bi...@mothra.msd.lmsc.lockheed.com

Clive...@armltd.co.uk

unread,
Jul 20, 1994, 5:30:13 AM7/20/94
to
Hmm. Possibly the only realistic widespread use for FPGA processing
engines is in the games console market? At least there you commit the
entire system to a single program!

Being able to have an MPEG engine for one game, and a polygon engine
for the next would be handy.

BUT: It's *still* not clear that the loss in transistor density and
speed caused by going from ASIC to FPGA is less than that caused by
going from ASIC to general-purpose CPU.

--Clive.
(Disclaimer: I wouldn't believe a word of this if I were you...)

Tom Kopec NE1G

unread,
Jul 20, 1994, 11:17:03 AM7/20/94
to

PALcode really isn't much like writeable control store..

Actually, depending on your definition of RISC, the concept of "writeable
control store" probably doesn't make sense on a RISC..

..tom
+===
Tom Kopec NE1G
Digital Equipment Corporation
Assistive Technology Group
Maynard, MA
+===
The opinions and comments expressed herein are my own and rarely, if ever,
reflect those of my employer.
+===

John G Dobnick

unread,
Jul 20, 1994, 4:44:43 PM7/20/94
to
From article <eeeCsn...@netcom.com>, by e...@netcom.com (Mark Thorson):

> By the end of the decade, advances in FPGA density will make
> the von Neumann architecture totally obsolete. People won't
> build hardwired CPU's when they can build soft CPU's that can
> transform themselves into the most appropriate architecture for
> the applications to which they are applied.
>
> For example, a computer animation program might need a hardware
> raster-drawing engine. A multimedia system could have a DSP-like
> architecture. A text editor might have hardware for fast drawing
> of fonts or window scrolling.

The "soft CPU" has already been done. 20+ years ago, in the commercial
marketplace. Look at the Burroughs 1700 series.

If I understand your thesis, this already did everything you are asking
for.

[Is it really true... "There ain't nothing new under the sun?"]

--
John G Dobnick "Knowing how things work is the basis
Computing Services Division for appreciation, and is thus a
University of Wisconsin - Milwaukee source of civilized delight."
j...@uwm.edu ATTnet: (414) 229-5727 -- William Safire

John G Dobnick

unread,
Jul 20, 1994, 4:53:03 PM7/20/94
to
From article <306m55$b...@fido.asd.sgi.com>, by na...@nudibranch.asd.sgi.com (Chuck Narad):

>
> burroughs also sold one in that timeframe, I think it was
> called the B1000 or B1200? anyway, it was a nanocoded engine,
> far as I know it was only used as a research vehicle.

B1700 series. (There was even a model in a red/white/blue cabinet, the
model B1776 -- for the bi-centennial! :-))

Various models were sold commercially. Never really caught on, though...
maybe it was just too _strange_. ("Strange" seems to be a common adjective
of Burroughs machines, it seems.)

Ronald G Minnich

unread,
Jul 20, 1994, 8:44:28 PM7/20/94
to
In article <30k2b...@uwm.edu>, John G Dobnick <j...@csd.uwm.edu> wrote:
>The "soft CPU" has already been done. 20+ years ago, in the commercial
>marketplace. Look at the Burroughs 1700 series.
>If I understand your thesis, this already did everything you are asking
>for.

Well, since i used to work for the company that did the B1700, and even have
one manual left, and now work for the outfit that's built FPGA architectures,
I can tell you that the "soft" FPGA architectures and the B1700 are about
as different as two things can be. For those who believe there is a similarity,
I suggest a careful reading of the B1700 description, followed by a
look at the Jan. 1991 IEEE Computer paper on Splash. The only "similarity"
is that there were aspects of the system that were changeable from app to app.
But it doesn't really make sense to group the two together.

--
rmin...@super.org | Error message of the week:
(301)-805-7451 or 7312 | **Error: vhdlsim,1:
| Array subscript out of bounds.
| (null)((null)): (null)

msan...@delphi.com

unread,
Jul 21, 1994, 7:57:12 PM7/21/94
to
John G Dobnick <j...@alpha1.csd.uwm.edu> writes:

>The "soft CPU" has already been done. 20+ years ago, in the commercial
>marketplace. Look at the Burroughs 1700 series.
>

Ok the soft CPU was done in a large machine 20+ years ago. Many of these
discusions consider a reprogramable CPU or atleast reconfiguable instructions
(i.e. core stays the same and an area is set aside for some "special purpose")
This sounds like a great idea, but the first step would be to model a simple
CPU in an FPGA and look to expand its functionality as bigger and faster
FPGA chips/architectures come out. Has any one actual built a basic CPU in
today's FPGA's. While they may need to dvance more for the advanced
functions it should be possible to create a slow 8 bit processor now.
Has this, or any more advanced chip been accomplished ??????

David M. Koppelman

unread,
Jul 21, 1994, 6:43:40 PM7/21/94
to

>Doesn't the Alpha have a writable control store? Actually it is called
>PALcode. I think it allows the programmer to create custom instructions, with
>complete control of the chip's resources. I am not sure of the details, but I
>think it was created so that the Alpha could emulate the VAX instruction set.
>
>Does anyone actually use the PALcode for speeding up algorithms, or is this
>even possible?
>
>
> -tom
> bi...@mothra.msd.lmsc.lockheed.com

According to an article by Richard Sites in the
February 1993 Communications of the ACM (vol. 36, no. 2, pp.33-)
PALcode is written in instructions which have access to architecturally
invisible parts of the machine state. (E.g., registers that non-system
programmers cannot access.) PALcode would be used for implementing
parts of the OS. I didn't see anything about a writable microcode.

However, another DEC chip, the NVAX, does have a patchable control
store (PCS). An item written in the PCS can replace a word of the
microcode (as though the micro ROM were overwritten). The patchable
control store is supposed to be used to fix minor bugs. The information
on the NVAX can be found in the Summer 1992 issue of the Digital
Technical Journal.

Chuck Narad

unread,
Jul 21, 1994, 9:51:51 PM7/21/94
to

In article <Ct74A...@butch.lmsc.lockheed.com>, bi...@mothra.msd.lmsc.lockheed.com writes:
>
> Doesn't the Alpha have a writable control store? Actually it is called
> PALcode. I think it allows the programmer to create custom instructions, with
> complete control of the chip's resources. I am not sure of the details, but I
> think it was created so that the Alpha could emulate the VAX instruction set.
>
> Does anyone actually use the PALcode for speeding up algorithms, or is this
> even possible?

PALcode is just a defined interface to calls that use undocumented
hardware features. It is more locore than microcode; although
lots of the low-level code run on RISC systems might be considered
microcode (but don't tell the OS programmers! :-)

Guy Harris

unread,
Jul 21, 1994, 8:19:33 PM7/21/94
to
>Hmm. I'll bet you never heard of the company Nanodata, who
>sold a soft archetecture machine called the QM-1 in the
>late 70's

I *have* heard of them, but, as far as I know, it was just a machine
with writable microcode - i.e., a conventional von Neumann machine in
which the interpreter for the instruction set can be replaced - not
anything like what Thorson was suggesting.

>Your observation
>seems logical, but historicly it didn't work out that way,

I don't know that there's any historical data involved - have there ever
been any machines of the sort Thorson was discussing, i.e. something
with reconfigurable *hardware*, not just rewritable microcode?

Guy Harris

unread,
Jul 21, 1994, 8:27:27 PM7/21/94
to
>The "soft CPU" has already been done. 20+ years ago, in the commercial
>marketplace. Look at the Burroughs 1700 series.
>
>If I understand your thesis, this already did everything you are asking
>for.

I'm not so sure of that.

The impression I have is that "soft CPUs" of that sort were
re-microprogrammable machines - i.e., machines in which

1) the interpretation of the instruction set was done in
something that I have the impression is closer to software
than to hardware (i.e., microcode)

and

2) you're not stuck with a hardwired version of that interpreter
software.

The impression I have of what Thorson is discussing is that he's talking
about machines in which the *data paths* of the machine are
reconfigurable; did either the Nanodata QM-1 (as mentioned in another
posting) or the Burroughs 1700 series have that, or was the bottommost
control layer (e.g. bottommost layer of microcode) stuck with the data
paths built into the machine?

Mark Smotherman

unread,
Jul 22, 1994, 11:40:46 AM7/22/94
to
g...@nova.netapp.com (Guy Harris) writes:

>The impression I have of what Thorson is discussing is that he's talking
>about machines in which the *data paths* of the machine are
>reconfigurable; did either the Nanodata QM-1 (as mentioned in another
>posting) or the Burroughs 1700 series have that, or was the bottommost
>control layer (e.g. bottommost layer of microcode) stuck with the data
>paths built into the machine?

The Nanodata was extremely flexible for it's day (ca. 1970), but it is
nowhere near the "soft" architecture that's being discussed. Its major
contribution was the combination of a vertical microcode level interpreted
by an underlying horizontal nanocode level.

See pp. 364-368 of John Hayes, Computer Architecture and Organization (2nd
ed.), McGraw-Hill, 1988, for a brief overview. (The 1978 1st edition also
has this section, but I can't cite the page numbers.)

The QM-1 had an 18-bit word size, but could operate in a 16-bit mode. This
allowed ease of emulation for 36-bit architectures as well as for 32-bit
architectures. Its flexibility derived from a set of programmable control
registers ("F store") which governed which mirco-arch. registers were
attached to which bus; there were 12 independent busses, several of which
had fixed functions (main memory address, micro control store address, ...).
These control registers (and thus connections within the data path) were
changeable by the nanoinstructions.

Each nanoinstruction had a 360-bit format with five major fields: a control
field K that was active for the duration of the nanoinstruction and four
T fields that were active in sequence. Bits in the T fields could cause a
repetition of the nanoinstruction, so a repeat loop was easily implemented
within a single nanoinstruction. Normally each T field was active for one
clock cycle (75 ns), but a bit could double the period of the current T
field (to 150 ns) to account for propagation delay through the ALU. The
ALU could do various forms of 16 or 18 bit signed integer arithmetic:
2's compl., 1's compl., unsigned, and decimal.

--
Mark Smotherman, Computer Science Dept., Clemson University, Clemson, SC

John Ahlstrom

unread,
Jul 25, 1994, 1:09:38 PM7/25/94
to
In <30k2b...@uwm.edu> j...@alpha1.csd.uwm.edu (John G Dobnick) writes:

>The "soft CPU" has already been done. 20+ years ago, in the commercial
>marketplace. Look at the Burroughs 1700 series.

>If I understand your thesis, this already did everything you are asking
>for.

>[Is it really true... "There ain't nothing new under the sun?"]

I don't think so. The B1700 et seq had uncghnaeable hardware bases.
There were multiple traditional assembler languages (360/20, B300, 1130, 1401)
and multiple HLL-oriented "machine" languages (Fortran, Cobol, RGP and
a McKeeman-like System Definition Language SDL). These languages
were implemented/interpreted by different microprograms that were
loaded into the machines writeable control store. There was no
underlying hardware reconfiguration.

In a world of small memories and high memory prices, this was
good engineering.

John Ahlstrom I can neither confirm nor deny
Boole & Babbage that these memories are or are not
j...@boole.com held by anyone else.


--

David P. Feldman

unread,
Jul 25, 1994, 3:58:07 PM7/25/94
to
In article <KOPPEL.94J...@omega.ee.lsu.edu> kop...@omega.ee.lsu.edu (David M. Koppelman) writes:

According to an article by Richard Sites in the
February 1993 Communications of the ACM (vol. 36, no. 2, pp.33-)
PALcode is written in instructions which have access to architecturally
invisible parts of the machine state. (E.g., registers that non-system
programmers cannot access.) PALcode would be used for implementing
parts of the OS. I didn't see anything about a writable microcode.

This definition of PALcode sounds to me to be a lot like micro-code. To me,
micro-code is what a CISC/RISC instruction expands to in the "decode" phase
of the pipeline. To be pedantic, this is horizontal microcode. Microcode
contains the command bits for the various units of the CPU such as register
port select number, register port read/write bit, ALU function number, etc.

However, another DEC chip, the NVAX, does have a patchable control
store (PCS). An item written in the PCS can replace a word of the
microcode (as though the micro ROM were overwritten). The patchable
control store is supposed to be used to fix minor bugs. The information
on the NVAX can be found in the Summer 1992 issue of the Digital
Technical Journal.

The VAX 750 had this as well. There were two microcode patch registers that
could be set to trap execution at certain places in the microcode. The
patch registers pointed into a downloadable section of the microcode I
think. There was a binary program that run under Ultrix 2.0 that did the
patching.

Also, the VAX 780 had fully downloadable microcode, downloaded from the
LSI-11 console subsystem.

=-Dave-F->
--
_ /| Dave Feldman
\'o.O' My opinions are my own, and do not USA : 212-449-6655
=(___)= represent those of my employer, DFA : land of wonder
U either implied or expressed. and enchantment

Robin Fairbairns

unread,
Jul 26, 1994, 10:54:40 AM7/26/94
to
In article <DFELDMAN.94Jul25155807@darkwing.0>,

David P. Feldman <dfeldman@darkwing.0> wrote:
>In article <KOPPEL.94J...@omega.ee.lsu.edu>
>kop...@omega.ee.lsu.edu (David M. Koppelman) writes:
>
> According to an article by Richard Sites in the
> February 1993 Communications of the ACM (vol. 36, no. 2, pp.33-)
> PALcode is written in instructions which have access to architecturally
> invisible parts of the machine state. (E.g., registers that non-system
> programmers cannot access.) PALcode would be used for implementing
> parts of the OS. I didn't see anything about a writable microcode.
>
>This definition of PALcode sounds to me to be a lot like micro-code. To me,
>micro-code is what a CISC/RISC instruction expands to in the "decode" phase
>of the pipeline. To be pedantic, this is horizontal microcode. Microcode
>contains the command bits for the various units of the CPU such as register
>port select number, register port read/write bit, ALU function number, etc.

It's much more like system call handlers on machines that have
privileged instructions to handle registers.

On the 21064, there are 5 instructions that are available in PALmode
but not in what DEC call native mode. The registers (2 of) these
instructions give access to are things like TLBs, interrupt masks,
control registers, etc. They're at a different level from the set you
outline above; all of those things are dealt with, hard-wired, by the
instruction decode circuitry.

(In fact, most of the internal registers can be made accessible to
kernel [native] mode programs by the PALcode -- the exception is the
ITB handling registers, which are only accessible if you're not using
the ITB, as in PALmode.)

Contrariwise, PALcode has access to all the instructions that native
code can use (even PAL call instructions, with some care), which isn't
the sort of way I think of coding microcode.

Of course, PALcode also deals with first-level interrupt despatch,
which is something that _is_ done in microcode on machines like the
VAX.
--
Robin (Campaign for Real Radio 3) Fairbairns r...@cl.cam.ac.uk
U of Cambridge Computer Lab, Pembroke St, Cambridge CB2 3QG, UK
Private page: http://www.cl.cam.ac.uk/users/rf/robin.html

Brad Hutchings

unread,
Jul 22, 1994, 4:48:44 PM7/22/94