Minimal ALU instruction set.

Joseph H Allen

unread,

May 15, 1998, 3:00:00 AM5/15/98

to

How about we change the criterion to make this discussion a little more
meaningful: what's the best possible general purpose processor you could
make given: one XC5202-6PC84C FPGA (64 configurable logic blocks; each with
four flip-flops. equiv. to about ~3000 gates. $7.40 each from DIGIKEY),
one 128Kx8 RAM (85ns), preloaded with the program to be run (total cost:
~$14.35). You must be able to handle interrupts.

Including an EPROM would be make this even more realistic, but I don't want
a harvard architecture machine with the ROM and RAM being used in parallel.
I choose 128K of RAM because you should deal with the fact that 64K is never
enough memory in practice. Keep in mind that you are going to be the one to
program this thing, so annoying banking/segmenting schemes may not be what
you want to have to deal with.

I had originally thought of this problem with the XC3020, but then the
XC5202 came out, which happens to be both cheaper and better (it has twice
as many FFs per CLB, plus dedicated carry logic).

I've thought about this problem for quite some time and it has given me a
greater appreciation of the old 8-bit processors. The basic problem is that
RISC does not work quite as well as a concept for narrow data paths, but
CISCish control logic takes up a huge amount of space.

When limited to the XC3020, the old 6502, sans the X register (since
(z-page,x) "preindex indirect" mode is not as useful as (z-page),y "post
index indirect" mode) and the BCD logic (although I've seen it used for
floating point libraries), plus bank switching for the RAM is close to the
best that you can do, but probably won't fit because of the complex control
logic (there are about 16 instruction sequences). There are also schemes
with no indexing (like the 8080), but as a programmer I want to be able to
access data structures efficiently, so I tried to avoid them. The trick
with the 6502 is that the pointer to your data structure is in z-page, and
the 8-bit Y register contains the offset. This ends up being about twice as
fast, compared to having to do the explicit double-precision add yourself.

What I think might fit is to get rid of the 8-bit Y register and 8-bit stack
pointer and instead reserve the first 16 bytes of RAM as base registers. So
the processor only has an 8-bit accumulator, a 16-bit program counter, the
carry flag, overflow flag and interrupt enable flag. You don't need zero or
negative flags, since the branch intructions can just test the accumulator
contents. You don't have to save the interrupt enable flag during
interrupts, but you do have to save carry and overflow. If you make all
instructions two-bytes long, you can then require them to be aligned and use
the LSB of the PC for one flag (like on the ARM chip). I'm just itching to
get rid of the overflow flag too, but it just sucks too much if you don't
have it. One possibility is to service interrupts only for every other
instruction (when the PC is evenly divisible by 4), and use the two least
significant bits for the flags. This is almost workable- the only problem
is that interrupts are delayed when you have a long sequence of branches to
odd word boundaries, which is unlikely but possible.

Most instructions have the format: |5-bit op-code|3-bit reg|8-bit offset|,
and the standard addressing mode is base-reg plus offset. Base reg 6 always
returns zero, so you also get zero-page direct. When the base reg is 7, the
offset is instead an immediate value for accumulator/immediate instructions.
Conditional branch intructions use the reg field for the condition code.
There is also a format for the jump and link instruction:
|2-bit op-code|14-bit direct addresss|. The direct address is shifted left
twice before being transfered to the program counter. The old program
counter and flags are stored in base register 6. Interrupts cause the jump
and link instruction to be invoked, but store the program counter and flags
in register 7 instead. There is a jump and link register instruction (with
the normal instruction format), which if used with register 6 or 7 cause a
return from subroutine or interrupt (so the pc to register transfer has to
be a swap). You can use the z-page direct mode to access register 6 or 7,
in case you want to save the values on a stack (perhaps with register 5).

This is pretty nice: only six instruction sequences: read/modify/write (for:
clr, inc, dec, cry (add carry), brw (subtract carry), rol, ror, lsl, lsr,
asr, sta); accumulator/memory and accumulator/immediate (add, adc, sub, sbc,
and, or, xor, lda); branch on condition; jump and link and jump and link
register. There are really only four sequences because the immediate modes
are the same as the memory modes with steps left out. The implementation
requires one temporary 16-bit register, the address register, the
instruction register, and data input/output registers in addition to the
program counter, accumulator and flags. You can use the ALU to increment
the PC, but you have to insert extra cycles when you cross page boundaries
(since the ALU is only 8 bits).

Total size alu+shift left (16 clbs), shift right on b-input of alu (4 clbs),
accumulator (4 clbs), temporary register (8 clbs), program counter (8 clbs),
intstruction register (4 clbs), address register (0: use registers in
io-pads), data registers (0: use registers in io-pads), address register mux
(16 clbs): gives 56 clbs. Hmm... might just fit in an XC3020 if I replace
part of the address register mux with tri-state buffers.

The XC5202 changes things: you might be able to have 16 8-bit registers on
chip, and use 24-bit wide program counters and address pointers. Best of
all, you can make it a load-store machine, limit all instructions to two
memory accesses, and use pipelining (so all instruction execute in 2
cycles).

I thought about top of stack machines as well, but it's not workable. I've
found that it's slower than the equivalent register machine, and bigger,
because it requires more complex disjoint instruction sequences.

Anyway it would be cool to finish this project and see if anyone wants to
buy the core, or make a 'basic stamp' like module which contains optional
other logic in the FPGA, such as video, serial ports, IDE interface, video,
etc.

--
/* jha...@world.std.com (192.74.137.5) */ /* Joseph H. Allen */
int a[1817];main(z,p,q,r){for(p=80;q+p-80;p-=2*a[p])for(z=9;z--;)q=3&(r=time(0)
+r*57)/7,q=q?q-1?q-2?1-p%79?-1:0:p%79-77?1:0:p<1659?79:0:p>158?-79:0,q?!a[p+q*2
]?a[p+=a[p+=q]=q]=q:0:0;for(;q++-1817;)printf(q%79?"%c":"%c\n"," #"[!a[q-1]]);}

Bernd Paysan

unread,

May 15, 1998, 3:00:00 AM5/15/98

to

Joseph H Allen wrote:
> I thought about top of stack machines as well, but it's not workable. I've
> found that it's slower than the equivalent register machine, and bigger,
> because it requires more complex disjoint instruction sequences.

I've some difficulties to follow this statement. If you do a small stack
machine, you would do it like a stripped down F21. Let's assume that you
use a format with 16 instructions (this is relatively comfortable, the
F21 has 32 opcodes, with about 20 of them actually used), so you can
load two instructions per memory cycle. Since memory is the bottleneck,
you can throw as much as 4 stack instructions into the game, where you
could previously execute one ALU instruction (two bytes - two read
accesses - I assume the data memory accesses don't change much). So even
if stack codes need more instructions to achieve the same job, it should
be faster. Well, this is academic, I haven't tried.

The register resources (8*16 bit) would be (bare minimum): PC, TOS, NOS,
A, and registers for deeper stacks as useful, perhaps two for
returnstack, two for data stack. Note that the stack elements are not
meant as memory stack - if it's full, data is lost (as with the
transputer).

--
Bernd Paysan
"Late answers are wrong answers!"
http://www.jwdt.com/~paysan/

David Tweed

unread,

May 15, 1998, 3:00:00 AM5/15/98

to

Joseph H Allen wrote:

> How about we change the criterion to make this discussion a little more
> meaningful: what's the best possible general purpose processor you could
> make given: one XC5202-6PC84C FPGA (64 configurable logic blocks; each with
> four flip-flops. equiv. to about ~3000 gates. $7.40 each from DIGIKEY),
> one 128Kx8 RAM (85ns), preloaded with the program to be run (total cost:
> ~$14.35). You must be able to handle interrupts.

I think you should take a look at the architecture of the PDP-11. You might
need to substitute compiler-generated sequences for some of the fancier
addressing modes, but in general it sounds like a good fit for what you're
trying to do.

For one thing, limiting yourself to an 8-bit memory with an 85-ns access time
is going to limit your memory bandwidth to about 10 MBps, plus you have 17-bit
addresses to deal with. The FPGA should be capable of clocking at somewhere
around 30-40 MHz, so why not spring for 15 to 20 nS memory and organize it as
64K x 16? Then 16-bit registers is a natural fit for both data and addresses,
and you can make four accesses (e.g., two instruction fetches and two data
accesses) in the time that it takes the 8-bit memory to fetch one byte. This
would make the core a serious contender in many embedded applications.
--
David Tweed

Mike Albaugh

unread,

May 15, 1998, 3:00:00 AM5/15/98

to

Joseph H Allen (jha...@world.std.com) wrote:
(Making a telling observation about "gate equivalents", albeit apparently
inadvertantly)

: How about we change the criterion to make this discussion a little more

: meaningful: what's the best possible general purpose processor you could
: make given: one XC5202-6PC84C FPGA (64 configurable logic blocks; each with
: four flip-flops. equiv. to about ~3000 gates. $7.40 each from DIGIKEY),

: When limited to the XC3020, the old 6502, sans the X register (since

: (z-page,x) "preindex indirect" mode is not as useful as (z-page),y "post
: index indirect" mode) and the BCD logic (although I've seen it used for
: floating point libraries), plus bank switching for the RAM is close to the
: best that you can do, but probably won't fit because of the complex control
: logic (there are about 16 instruction sequences).

I don't know how many "gate equivalents" the smaller chip has, but
I'd note that a _real_ 6502 is "about 3200 transistors". It also cost,
last time I checked, <$2. Western Design Center is reportedly selling
a synthesizable version, although I don't know at what price. As another
sample point, a 32-bit RISC can be implmented in about 28K transistors.
("can be", as in, "I was on a team that did one, and several are currently
running code within a hundred feet of me, although those systems are
getting old", not "my estimates suggest")

Anyway, what is it about gate utilization that makes a ~~3K
transistor CPU too large for a ~~3K gate FPGA? Will this change any
time soon, or do we just get used to "YMMV" attached to "equivalent
gate" counts? BTW: as much as I hate the "Please email because I don't
read this group", not that this is cross-posted. _I_ read comp.arch pretty
regularly, but I don't read ..fpga, primarily because I am now officially
a "software guy", with no access to FPGA tools, and no interest in buying
a Wintel machine, and one of the "cheap" packages just to fool around with
them in "my spare time". I'm interested in the re-invention of "lean"
machines, and hope to follow this in comp.arch, but comments specifically
directed at "the error of my ways" in re FPGAs should be emailed.

Side note 2: I think if you spent much time doing Forth on a
6502, you'd have a better appreciation for indirect-X :-) Most of the
rest looks plausible, but I wonder about:

: You don't have to save the interrupt enable flag during

: interrupts, but you do have to save carry and overflow.

No direct access to the interrupt flag can make programming "safe"
multitasking, er, _interesting_.

: is that interrupts are delayed when you have a long sequence of branches to

: odd word boundaries, which is unlikely but possible.

Such "unlikely but possible" gotchas are the absolute bane of
folks trying to write reliable software. Just say no.

In general, I'd second the other post which mentioned doing
something more like a PDP-11, with the "registers" in a dedicated RAM
area. The 6502 was never particularly friendly to the sort of "Stepchild
of Algol" languages which are still the rage for embedded systems, and
if you really want to sell this beast, difficulty of software support
will be a big issue. OTOH, the most popular smallish machines (8051
and PIC) are not easy targets for such HLLs either. OTOOH, they are
a darn-sight cheaper than anything you could do with an FPGA :-)

Mike
| alb...@agames.com, speaking only for myself

Landon Dyer

unread,

May 15, 1998, 3:00:00 AM5/15/98

to

In comp.arch David Tweed <dtw...@aristech.com> wrote:
:>For one thing, limiting yourself to an 8-bit memory with an 85-ns access time

:>is going to limit your memory bandwidth to about 10 MBps, plus you have 17-bit
:>addresses to deal with. The FPGA should be capable of clocking at somewhere
:>around 30-40 MHz, so why not spring for 15 to 20 nS memory and organize it as
:>64K x 16? Then 16-bit registers is a natural fit for both data and addresses,
:>and you can make four accesses (e.g., two instruction fetches and two data
:>accesses) in the time that it takes the 8-bit memory to fetch one byte. This
:>would make the core a serious contender in many embedded applications.

hmm... didn't someone do an ECL version of the 6502, back in the
late 80s? for the time, i understand, it was a screamer

some guys in a lab at mostek had had too many beers one friday
evening, and i'm informed that the world's record for a stock 6502
(circa 1982) was something like 24mhz before the enabling smoke leaked
out. tough little fella

--

lan...@best.com

"Sundial with a great hand, sweeping dust across the floor,
Puts a strain on a sane man, 'til he knows what he's looking for."
-- Kim Carnes

Jeffrey S. Dutky

unread,

May 15, 1998, 3:00:00 AM5/15/98

to

Joseph H Allen wrote:
>
> How about we change the criterion to make this discussion a little more
> meaningful: what's the best possible general purpose processor you could
> make given: one XC5202-6PC84C FPGA (64 configurable logic blocks; each with
> four flip-flops. equiv. to about ~3000 gates. $7.40 each from DIGIKEY),
> one 128Kx8 RAM (85ns), preloaded with the program to be run (total cost:
> ~$14.35).

> (a bunch of talk about putting flags into the low order IP bits) One

> possibility is to service interrupts only for every other instruction
> (when the PC is evenly divisible by 4), and use the two least significant
> bits for the flags. This is almost workable- the only problem is that
> interrupts are delayed when you have a long sequence of branches to
> odd word boundaries, which is unlikely but possible.

Just require that all branch destinations be on four-byte boundries. You
would only waste, on average, one byte per jump in the code, and you
could
put your flags into the IP. While you're at it, shift the immediate
value
used on branches two bits left, this gives you a longer "short branch"
as
well as preventing misaligned branches.

- Jeff Dutky

Rickman

unread,

May 15, 1998, 3:00:00 AM5/15/98

to

Joseph H Allen wrote:
>
> How about we change the criterion to make this discussion a little more
> meaningful: what's the best possible general purpose processor you could
> make given: one XC5202-6PC84C FPGA (64 configurable logic blocks; each with
> four flip-flops. equiv. to about ~3000 gates. $7.40 each from DIGIKEY),
> one 128Kx8 RAM (85ns), preloaded with the program to be run (total cost:
> ~$14.35). You must be able to handle interrupts.

--
I'm not sure why you want to invent a CPU from a $7 chip when there are
so many out there for $2, but here is my two cents worth.

Your goals are not clear. Are you shooting for minimal price, maximum
performance, some optimal trade off, or is this just a method of
entertainment?

There has always been a trade off between accessing registers vs. memory
in instructions. Registers are fast and allow instruction size to be
kept small, but the number of registers are limited and they slow the
machine when you need to save context.

Your CPU-on-an-FPGA most likely will not run a lot faster than the SRAM
attached, so you don't get a lot of speed improvement by having
registers on chip. One way to optimize this trade off in such a case is
to have registers in memory pointed to by a single internal register. TI
did this in their 9900 series of 16 bit processors. A "Workspace
Pointer" was the only on board register other than the PC and status (I
think) and pointed to R0 with R1 through R15 following in memory. This
was very slow compared to a RISC machine, but you won't be getting near
RISC speeds with a small Xilinx based architecture anyway. The
instructions can still be kept small with register based addressing and
context switches are fast; only the three internal registers need to be
saved.

TI did not have a stack pointer. You had to implement that in software
yourself. They normally used one of the registers as a "link" register.
You save your return address in R11 as you do a call after loading a new
workspace pointer. But each call was two words long, one for the call
address, one for the new workspace pointer. Another way to do this would
be to treat the workspace pointer as a stack pointer. Increment it by N
each time you call a subroutine. The subroutine can increment it by M
for local storage.

A scheme similar to this may be the optimal approach given your
scenario.

Rick Collins

ric...@XYwriteme.com

remove the XY to email me.

Joseph H Allen

unread,

May 15, 1998, 3:00:00 AM5/15/98

to

In article <6jhp0e$a31$1...@void.agames.com>,

Mike Albaugh <alb...@agames.com> wrote:
>Joseph H Allen (jha...@world.std.com) wrote:
>(Making a telling observation about "gate equivalents", albeit apparently
> inadvertantly)

>: How about we change the criterion to make this discussion a little more

>: meaningful: what's the best possible general purpose processor you could
>: make given: one XC5202-6PC84C FPGA (64 configurable logic blocks; each with
>: four flip-flops. equiv. to about ~3000 gates. $7.40 each from DIGIKEY),

>: When limited to the XC3020, the old 6502, sans the X register (since

>: (z-page,x) "preindex indirect" mode is not as useful as (z-page),y "post
>: index indirect" mode) and the BCD logic (although I've seen it used for
>: floating point libraries), plus bank switching for the RAM is close to the
>: best that you can do, but probably won't fit because of the complex control
>: logic (there are about 16 instruction sequences).

> I don't know how many "gate equivalents" the smaller chip has, but

>I'd note that a _real_ 6502 is "about 3200 transistors". It also cost,
>last time I checked, <$2. Western Design Center is reportedly selling
>a synthesizable version, although I don't know at what price. As another
>sample point, a 32-bit RISC can be implmented in about 28K transistors.
>("can be", as in, "I was on a team that did one, and several are currently
>running code within a hundred feet of me, although those systems are
>getting old", not "my estimates suggest")

Well if you want to buy a $60 XC4006 FPGA you can pretty easily fit a MIPS
clone (2-cycles per instruction, 8-cycle pipeline, no barrel shifter
(1-cycle per bit shift), no cache, 32-cycle multiply and divides). They
claim that it's a 6000 gate device, but architecture differences have a huge
impact. Namely you can use the CLBs as 32x1 rams to implement the register
file. Of course this hopelessly uncompetitive compared to a real CPU (IDT
R3000 clones cost $15 and include 2KB cache).

> Anyway, what is it about gate utilization that makes a ~~3K
>transistor CPU too large for a ~~3K gate FPGA?

The 3020 is just too small (they claim 2000 gates), but the 2 FFs per clb
and lack of carry logic are a big problem:

binary alu: 16 clbs (it has no carry logic, so it's 8 clbs for the xors and 8
clbs for a ripple carry). The carry logic can do left shift.
bcd corrector: 8 clbs
right shifter: 4 clbs
a, x, y, lo, hi, s, p, pcl, pch, ir : 44 clbs.
input register, output register, address register: 0 clbs (use io-pads).
~16 3-cycle sequences done with one-hot bit encoding: 24 clbs.
Instruction decoding, alu & etc. recoding: 10 clbs.

Total: ~106 clbs. The 3030 has 100 clbs, so maybe. The XC3020A cost 10.35
and the XC3030A cost $10.88. These were the previous bottem end chips, but
now the XC5202 is out ($7.40)- it is much better for CPUs: 4 FFs per CLB
plus dedicated carry logic (but no FFs in the pads). It is still only 3000
gates, but should pretty easily fit the 6502.

>Will this change any time soon, or do we just get used to "YMMV" attached
>to "equivalent gate" counts? BTW: as much as I hate the "Please email
>because I don't read this group", not that this is cross-posted. _I_ read
>comp.arch pretty regularly, but I don't read ..fpga, primarily because I am
>now officially a "software guy", with no access to FPGA tools, and no
>interest in buying a Wintel machine, and one of the "cheap" packages just
>to fool around with them in "my spare time".

Sheesh you're no fun! The "cheap" packages are actually pretty cheap: I
found The Xilinx "student edition" tools for XC4000 series devices for $150
at Quantum Books in Cambridge, MA.

> Side note 2: I think if you spent much time doing Forth on a
>6502, you'd have a better appreciation for indirect-X :-) Most of the
>rest looks plausible, but I wonder about:

I'm sure you can find some use for it, but (z-page,x) mode is still much
less useful than (z-page),y.

>: You don't have to save the interrupt enable flag during

>: interrupts, but you do have to save carry and overflow.

> No direct access to the interrupt flag can make programming "safe"
>multitasking, er, _interesting_.

Well you can still set or clear the flag. It just doesn't get saved during
interrupts. Instead, when you get an interrupt, it is set so that
interrupts are disabled. You got the interrupt, so you know that interrupts
had been previously enabled- so it's ok to reenable interrupts at the end of
the interrupt handler.

> In general, I'd second the other post which mentioned doing
>something more like a PDP-11, with the "registers" in a dedicated RAM
>area. The 6502 was never particularly friendly to the sort of "Stepchild
>of Algol" languages which are still the rage for embedded systems, and
>if you really want to sell this beast, difficulty of software support
>will be a big issue. OTOH, the most popular smallish machines (8051
>and PIC) are not easy targets for such HLLs either. OTOOH, they are
>a darn-sight cheaper than anything you could do with an FPGA :-)

Yeah, but the FPGA can have many more peripheral options. Also the PIC may
be cheaper, but the 8051 probably isn't. Have you seen recent PICs? They
have 8-pin SOIC verions... I'm waiting for the 6-pin SOT version.

There's got to be some story with those PICs. They are not new processors
at all: The General Instrument 1650 is listed in my Osborne 4&8 bit
processor handbook from 1981. I think that 'basic stamp' marketing thing
must have done it. Or maybe it happened when they came out with EPROM
versions. Anyway now they're everywhere.

John Eaton

unread,

May 15, 1998, 3:00:00 AM5/15/98

to

David Tweed (dtw...@aristech.com) wrote:
: Joseph H Allen wrote:

: For one thing, limiting yourself to an 8-bit memory with an 85-ns access time
: is going to limit your memory bandwidth to about 10 MBps, plus you have 17-bit
: addresses to deal with. The FPGA should be capable of clocking at somewhere
: around 30-40 MHz, so why not spring for 15 to 20 nS memory and organize it as
: 64K x 16? Then 16-bit registers is a natural fit for both data and addresses,
: and you can make four accesses (e.g., two instruction fetches and two data
: accesses) in the time that it takes the 8-bit memory to fetch one byte. This
: would make the core a serious contender in many embedded applications.

: --
: David Tweed

Many FPGA's tout speeds up into the 100 Mhz's. It sounds great until you you realize
and the only design that will actually run that fast is a shift register. This makes
FPGA's the ideal canidate for a CPU design using serial logic. With serial logic all
registers are serial shift registers and you can load anyone with a single bit mux.
A 8 bit Full adder is simply a one bit full adder stage and one carry flip flop.
It a great speed/logic trade off.

John Eaton

Mike Albaugh

unread,

May 15, 1998, 3:00:00 AM5/15/98

to

Joseph H Allen (jha...@world.std.com) wrote:

: In article <6jhp0e$a31$1...@void.agames.com>,
: Mike Albaugh <alb...@agames.com> wrote:

: >sample point, a 32-bit RISC can be implmented in about 28K transistors.

: >("can be", as in, "I was on a team that did one, and several are currently
: >running code within a hundred feet of me, although those systems are
: >getting old", not "my estimates suggest")

: Well if you want to buy a $60 XC4006 FPGA you can pretty easily fit a MIPS
: clone (2-cycles per instruction, 8-cycle pipeline, no barrel shifter
: (1-cycle per bit shift), no cache, 32-cycle multiply and divides). They
: claim that it's a 6000 gate device, but architecture differences have a huge
: impact. Namely you can use the CLBs as 32x1 rams to implement the register
: file. Of course this hopelessly uncompetitive compared to a real CPU (IDT
: R3000 clones cost $15 and include 2KB cache).

Yes, well the 28K-transistor RISC I mentioned did have a barrel-
shifter, no MUL/DIV but software MUL was 3..4 cycles/bit, div 5..6. (IIRC)
1 CPI other than ld/st, (no cache or separate I-bus, so LD/st took 2, no
load/use penalty) 4-stage pipe, but double-clocked, so it looked like
two... $6.21 in 1991. I know, apples and oranges, but we are still looking
at an order of magnitude in price/performance, and 7 years of progress.

: >now officially a "software guy", with no access to FPGA tools, and no

: >interest in buying a Wintel machine, and one of the "cheap" packages just
: >to fool around with them in "my spare time".

: Sheesh you're no fun! The "cheap" packages are actually pretty cheap: I
: found The Xilinx "student edition" tools for XC4000 series devices for $150
: at Quantum Books in Cambridge, MA.

Yeah, $150 for the software, $600 _minimum_ for a machine that
will run it (unless you think it will run under MacOS on the kids
2ci, or Irix on my work Indy) plus another couple-hundred for a decent
monitor, $1K is a bit past my "fooling around" budget. _and_ I've lately
had to have more contact with Windows that I ever expected or wanted,
and just don't see spending more such time as "relaxing" in any non-
sarcastic sense of the term...

: > No direct access to the interrupt flag can make programming "safe"
: >multitasking, er, _interesting_.

: Well you can still set or clear the flag. It just doesn't get saved during
: interrupts. Instead, when you get an interrupt, it is set so that
: interrupts are disabled. You got the interrupt, so you know that interrupts
: had been previously enabled- so it's ok to reenable interrupts at the end of
: the interrupt handler.

If you can't read it, you end up as so many novice RTOSes do,
with seprate versions of routines that manipulate shared data structures,
one to be called "if you are in an interrupt routine (where disabling
is not needed and enabling is deadly) and one when you are not (where
disabling around the critical section is deadly, and enabling mandatory).
Of ciourse, any routine which _calls_ those routine must pick the righht
one, and if _it_ can't "see" the interrupt flag, it also must come
in two flavors, ad nauseum. You run out of that 128K code space pretty
quickly with all that duplication. OK, if you code as if on a multi-
processor (single-writer rule etc) you can do a bit better, but it
still sucks.

Rita Madarassy

unread,

May 15, 1998, 3:00:00 AM5/15/98

to

In article <6ji388$4...@wwproxy.vcd.hp.com>,
John Eaton <jo...@vcd.hp.com> wrote:

>David Tweed (dtw...@aristech.com) wrote:
>: --
>: David Tweed
>
>Many FPGA's tout speeds up into the 100 Mhz's. It sounds great until you you realize
>and the only design that will actually run that fast is a shift register. This makes

I do not know what was the latest FPGA you used but your statement is not true
specially if using a XC4000XL-09 part.
A shift register design will run at speeds greater than 200MHZ when carefully
foorplanned. You can get it to run for sure more than 150MHZ by only adding
timing constraints.

Any intelligent designer should be able to implement design running at more
than 120MHZ if pipelining is an option and has good knowledge of the
architecture he is targetting.

Steve Casselman

unread,

May 15, 1998, 3:00:00 AM5/15/98

to

Rickman wrote:

> Joseph H Allen wrote:
> >
> > How about we change the criterion to make this discussion a little
> more
> > meaningful: what's the best possible general purpose processor you
> could
> > make given: one XC5202-6PC84C FPGA (64 configurable logic blocks;
> each with
> > four flip-flops. equiv. to about ~3000 gates. $7.40 each from
> DIGIKEY),
> > one 128Kx8 RAM (85ns), preloaded with the program to be run (total
> cost:
> > ~$14.35). You must be able to handle interrupts.

> --
> I'm not sure why you want to invent a CPU from a $7 chip when there
> are
> so many out there for $2, but here is my two cents worth.
>
> Your goals are not clear. Are you shooting for minimal price, maximum
> performance, some optimal trade off, or is this just a method of
> entertainment?

You might want to consider a scenario where you add a little flash
and a small pal. You would then have your pal boot the CPU design
into the FPGA form the flash and as you went along you might have some
compute intensive task to perform so the CPU/FPGA tells the flash
to reconfigure into say a DSP to grind away then switch back again
when your done with that task. Now your $7 FPGA + $4 Flash and $1
pal = $2 CPU + $20 DSP. Since xc5202 takes only 5302 bytes for a
configuration you could fit quite a few "chips" in a 64k flash. In fact
why not spend $2 for the CPU and have it do the configuration?

--
Steve Casselman, President
Virtual Computer Corporation
http://www.vcc.com

Brad Rodriguez

unread,

May 16, 1998, 3:00:00 AM5/16/98

to

David Tweed wrote:
> I think you should take a look at the architecture of the PDP-11. You might
> need to substitute compiler-generated sequences for some of the fancier
> addressing modes, but in general it sounds like a good fit for what you're
> trying to do.

For what it's worth, those following this thread may be amused by my
"pathetic instruction set computer" made from discrete TTL. See
<http://www.zetetics.com/bj/papers/piscedu2.htm>. I've occasionally
described it as "PDP-11 meets 1802".

Of course, discrete TTL and FPGAs have different constraints. John
Rible did an interesting minimal CPU, the QS2 (see the citations in my
paper -- I don't know if it's on the web anywhere).

--
Brad Rodriguez, T-Recursive Technology bj...@aazetetics.com
Email address spoofed to block spam. Remove "aa" from name & domain.
Embedded software & hardware design... http://www.zetetics.com/recursiv
See the new CamelForth page at........ http://www.zetetics.com/camel

John Rible

unread,

May 16, 1998, 3:00:00 AM5/16/98

to

Brad Rodriguez wrote:
>
> <snip>

>
> Of course, discrete TTL and FPGAs have different constraints. John
> Rible did an interesting minimal CPU, the QS2 (see the citations in my
> paper -- I don't know if it's on the web anywhere).
>

It's not, but I'll send an MSword6 description of it to email requests. I'm
using it for a course at UCSC Extension, where students add features in a
10-week design project. It's a 16-bit 'baby-RISC' with 8 registers in about 4k
QuickLogic gates.

John Rible \ SandPiper Technology voice: 408-458-0399
\ hardware, software and so forth... fax: 408-471-0175
jri...@sandpipers.com \ 317 California Street, Santa Cruz, CA 95060-4215

Mike Butts

unread,

May 17, 1998, 3:00:00 AM5/17/98

to

In fact, a fine poster paper about a small FPGA-based processor
was given at FCCM last month. I'm so inspired by it, and by the
new availability of cheap FPGAs and tools, that I'm developing
a little minimum-cost FPGA+SRAM board and a similar little
processor design for it. I'm hoping to make it available by
the end of the year. (This is strictly a personal hobby effort
on my part, for fun, experimental and educational purposes.)

The poster paper is "A FPGA based Forth microprocessor", by
P.H.W. Leong, P.K. Tsang and T.K. Lee, of the Chinese U. of
Hong Kong. They developed a small 16-bit stack processor,
in VHDL, which synthesized into 150 CLBs of a Xilinx XC4013E-3.
The data and return stacks, 16 words deep each, are in CLB
RAM. It uses 4-bit (zero-operand) instructions, packed four to
a word, except for a 16-bit call instruction. It runs at 10 MHz
= 10 MIPS. They targeted a public domain Forth metacompiler
called SOD32 to it. Most of the 4013 is left over for other
hardware. Neat!

Three years ago at FCCM '95, Philip Frieden showed us his
16-bit minicomputer (in a real box with lights and switches!)
that fit into part of an XC4005. So a small GP processor
certainly can be programmed into an FPGA with great success.
But with the FPGA we can extend the processor's functions
in application-specific ways, even dynamically on demand,
to get higher performance than hard-wired CPUs and ASICs, or
to get the same perfomance at lower cost. That's
reconfigurable computing.

Both Altera and Xilinx have recently made substantial FPGAs
and tools available cheaply. Altera's FLEX 6016 is $20 in
single quantities. It has 1320 LEs (LE = 4-LUT+FF), good for 8K
gates or more, and 117 I/Os in the cheap 144-pin TQFP package.
The full set of tools is free for that part from their web
site. No RAMs in the FLEX 6K, though, sigh. Xilinx's
Spartan series, which is similar to the XC4000 family,
is also low priced, and they are selling tools for $95,
or less included in that textbook. These are real, mainstream
FPGA architectures with real tools, that are finally available
at prices that individual experimenters and students can afford.
Finally, after all these years!

So I'm working up a little Altera 6016 + SRAM board. Now there
are fast registered/synch 1 Mbit SRAMs, at $9 in singles, thanks
to PC motherboard caches. Synchronous SRAMs are far more sensible
to deal with in an FPGA system, since there is no need to shape
a WE pulse mid-cycle that meets all the specs. It's damned hard
to do that and still go fast while observing worst-case specs, as
was recently discussed in comp.arch.fpga. With the synch SRAM,
your address, data in and WE just need to setup and hold around
a clock edge. So I'm hitching a 64K by 16 synch SRAM to the 6016.
Plus a parallel port connection for programming the 6016 and for
fast PC I/O, a serial line driver/rcvr, and a clock can, some
headers connected to uncommitted I/Os, and some buttons and blinky
lights. It will also plug onto a little hex keyboard and 64x2 LCD
module, a motor driver module, etc., that my partner in all this
already makes and sells for his 68HC11 robotics controller boards.
<http://www.teleport.com/~raybutts> This little board will be
a minimum stand-alone reconfigurable computer. I'm calling
it KIM-RC for 'Keep It Minimum' Reconfigurable Computer.

I'm developing a small 16-bit stack CPU design, using the
SRAM for code, data and stacks, and leaving most of the
FPGA for custom-configured instructions, and reconfigurable
computing facilities in general. And then port a little compiler
or two to the instruction set. This is easier than it sounds
in Forth; many Forths are designed for extreme portability
by being all written in Forth except for a few primitive words.
Some small C compiler could also be targeted to the board.
Or of course it could just be used directly as an FPGA+SRAM
module for whatever, too.

I would include the source design for the stack CPU with the
board, under the GPL, to encourage its use as a base for
reconfigurable computing. Others could add their own custom
stuff, I hope, and likewise put it out there under the GPL.

If all this works out, I'd like to do a second board, with
a flash RAM and a PLD, so many configurations can be stored
in the flash and loaded on demand by the FPGA, which would
save its state first in the SRAM. (Yes, Steve, I already
thought of that too... ;-) This would also provide some
non-volatility, so the little board could stand alone in
embedded apps.

But for starters, it will be plenty if we can just get the
little FPGA+SRAM board together and running by the end of
the year. The real trick will be hand-soldering the 20-mil
TQFPs to the board! We'd like to make KIM-RCs available to
students and experimenters for a low price, if it all
works out. I'll post progress reports, if people are
interested.

--Mike

Fitz

unread,

May 17, 1998, 3:00:00 AM5/17/98

to

Mike,

Please keep us informed on your progress.

Thanks,
Fitz

msi...@tefbbs.com

unread,

May 17, 1998, 3:00:00 AM5/17/98

to

Read a book called "Stack Computers" by Phil Koopman available free on
the net.

Some of the ideas in this book are being used in a Java Engine by
Patriot Scientific.

Simon
---------------------------------------------------------------------------------------------------------------------------------------

Opinions expressed herein are solely my own and may or may not reflect my opinion at this particular time or any other.

David R Brooks

unread,

May 17, 1998, 3:00:00 AM5/17/98

to

Brad Rodriguez <bj...@aazetetics.com> wrote:

:David Tweed wrote:
...
:For what it's worth, those following this thread may be amused by my

:"pathetic instruction set computer" made from discrete TTL. See
:<http://www.zetetics.com/bj/papers/piscedu2.htm>. I've occasionally
:described it as "PDP-11 meets 1802".

Also my "Simplex-III". It's on my website (URL below).

Joseph H Allen

unread,

May 18, 1998, 3:00:00 AM5/18/98

to

In article <355C7C91...@yahoo.com>,
Rickman <spamgo...@yahoo.com> wrote:

>Your goals are not clear. Are you shooting for minimal price, maximum
>performance, some optimal trade off, or is this just a method of
>entertainment?

Well... the priority is entertainment (of course), followed by price and
then performance.

This scheme is pretty nice because the workspace pointer can be relatively
small. A trick, if you want to allow 1/2 overlapping workspaces without
requiring an additional adder, is to precompute the workspace base address
and the base address plus 1/2 the number of registers. The most significant
register select bit could then select between the two base addresses.

Joseph H Allen

unread,

May 18, 1998, 3:00:00 AM5/18/98

to

In article <355f5eaf...@news.megsinet.net>, <msi...@tefbbs.com> wrote:

>Read a book called "Stack Computers" by Phil Koopman available free on
>the net.

>Some of the ideas in this book are being used in a Java Engine by
>Patriot Scientific.

I've been thinking about these stack processors, but I'm still not
convinced. They don't do particularly well with structure accesses or even
with simple block copies, and they tend to require that a lot of the stack
to be cached on the FPGA (so they're bigger).

Someone mentioned trying to have two 4-bit instructions per fetch, but I
don't think it's really workable. You would not have space for swap, over,
or multiple word sizes for ! and @- which makes the block copy loop even
worse.

If I were using the XC4000 series they might be workable because of the
relatively large stack cache. But if I'm going to assume an XC4000, I might
as well assume wider memory paths, implement a MIPS or ARM clone and then
have GCC available for it.

As for Java engines... well I have just two words: "code generator". It's
difficult to see how a Java engine could possibly compete with a modern RISC
processor and Java with a real code generator tacked on to it.

David Tweed

unread,

May 18, 1998, 3:00:00 AM5/18/98

to

John Eaton wrote:
> Many FPGA's tout speeds up into the 100 Mhz's. It sounds great until you you
> realize and the only design that will actually run that fast is a shift register.

That's why I suggested 30-40 MHz for a realizable CPU core
of the type we're talking about.

> This makes FPGA's the ideal candidate for a CPU design using serial logic.

I really should have thought of this myself. I once did an
FPGA design that had to do about 4-5 add/subtract operations
on integers that were around 15-20 bits wide, 386000 times
per second. Since my oscillator ran at 6.176 MHz, I had 16
clocks available per calculation whether I used them or not.
I designed a 2-bit wide data path that turned out to be an
excellent fit architecturally for the Actel FPGA that I was
using at the time.

--
David Tweed

Keith Wootten

unread,

May 18, 1998, 3:00:00 AM5/18/98

to

In article <Et568...@world.std.com>, Joseph H Allen
<jha...@world.std.com> writes

>In article <355f5eaf...@news.megsinet.net>, <msi...@tefbbs.com> wrote:
>

>>Read a book called "Stack Computers" by Phil Koopman available free on
>>the net.
>
>>Some of the ideas in this book are being used in a Java Engine by
>>Patriot Scientific.
>

>I've been thinking about these stack processors, but I'm still not
>convinced. They don't do particularly well with structure accesses or even
>with simple block copies, and they tend to require that a lot of the stack
>to be cached on the FPGA (so they're bigger).

I'm not sure what you mean about simple block copies - surely this is
just something which a chip either does well or doesn't do well
depending on the hardware available, irrespective of whether it's a
stack machine or not?

Load source address to register A
Load destination address to register B
Load count to register C

Read data at A, post incrementing A
Write data at B, post incrementing B
Decrement C and repeat till zero.

FWIW the PSC1000 does the above loop with three 8bit opcodes for up to
2^32 iterations, these three making 3/4 of a 32bit instruction group.

IMO you're right about the stacks - these ideally need to be on-chip for
efficiency and this can use (for an FPGA) a fair amount of resources.
An alternative is to have an external RAM for each of the two stacks
like the Novix chip had. This approach, of course, uses a lot of I/O as
it needs three data busses, one address bus for main memory, and two
small address busses for stack memory plus associated control lines.

[rest snipped]

Cheers
--
Keith Wootten

David R Brooks

unread,

May 18, 1998, 3:00:00 AM5/18/98

to

da...@iinet.net.au (David R Brooks) wrote:

:
My last post left off the URL: here it is :)

-- Dave Brooks <http://www.iinet.net.au/~daveb>
PGP public key via <http://www.iinet.net.au/~daveb/crypto.html>, or servers

Rickman

unread,

May 18, 1998, 3:00:00 AM5/18/98

to

Keith Wootten wrote:
...snip...

> IMO you're right about the stacks - these ideally need to be on-chip for
> efficiency and this can use (for an FPGA) a fair amount of resources.
> An alternative is to have an external RAM for each of the two stacks
> like the Novix chip had. This approach, of course, uses a lot of I/O as
> it needs three data busses, one address bus for main memory, and two
> small address busses for stack memory plus associated control lines.

--
The term "a lot of chip resources" is subjective, but I would implement
the data stack using the dual port RAM in a Xilinx part. I have always
questioned the way Xilinx designed the dual port RAM since you only get
a single bit of dual port in a CLB which can otherwise support two bits
of single port RAM.

I usually want to implement a FIFO which does a read and a write
simultaneously. That could be done while providing two RAM bits from a
CLB. But when you want to perform two READs from the RAM, as in a
register file or stack, this dual port implementation shines. With the
appropriate control logic, you can implement the top 16 words of stack
and perform almost any basic operation to the top of stack in one clock
cycle. When your stack gets full, you can write a couple of words from
the bottom of stack to memory. This could be thought of as a stack
"cache".

The Xilinx dual port RAM is also very good for straight register files.
You can read any two registers and write back to one of them in a single
clock cycle.

Joseph H Allen

unread,

May 19, 1998, 3:00:00 AM5/19/98

to

In article <X0F+fHAU...@wootten.demon.co.uk>,

Keith Wootten <Ke...@wootten.demon.co.uk> wrote:
>In article <Et568...@world.std.com>, Joseph H Allen
><jha...@world.std.com> writes
>>In article <355f5eaf...@news.megsinet.net>, <msi...@tefbbs.com> wrote:
>>

>>>Read a book called "Stack Computers" by Phil Koopman available free on
>>>the net.
>>
>>>Some of the ideas in this book are being used in a Java Engine by
>>>Patriot Scientific.
>>

>>I've been thinking about these stack processors, but I'm still not
>>convinced. They don't do particularly well with structure accesses or even
>>with simple block copies, and they tend to require that a lot of the stack
>>to be cached on the FPGA (so they're bigger).
>
>I'm not sure what you mean about simple block copies - surely this is
>just something which a chip either does well or doesn't do well
>depending on the hardware available, irrespective of whether it's a
>stack machine or not?
>
>Load source address to register A
>Load destination address to register B
>Load count to register C
>
>Read data at A, post incrementing A
>Write data at B, post incrementing B
>Decrement C and repeat till zero.

>FWIW the PSC1000 does the above loop with three 8bit opcodes for up to

>2^32 iterations, these three making 3/4 of a 32bit instruction group.

Huh? Wouldn't you need something like:

1000 ; push destination address
2000 ; push source address
loop:
over ; dup destination
over ; dup source
@ ; replace source with contents
! ; write contents to destination
2 ; increment source
add
swap ; increment dest
2
add
swap
...

I.E., 10 instructions to move each word (unless I'm really missing something
about these 0-address top of stack machines). Maybe 8 if you have an
increment by 2 instruction. The load and store each require at least 2
cycles and the rest require at least one each, so that's 10 cycles per word.

A one-address machine with indexing (and a 16-bit accumulator) needs:

loop:
lda 0,r1
sta 0,r2
lda 1,r1
sta 1,r2
...

I.E., two instructions per move.

>IMO you're right about the stacks - these ideally need to be on-chip for
>efficiency and this can use (for an FPGA) a fair amount of resources.
>An alternative is to have an external RAM for each of the two stacks
>like the Novix chip had. This approach, of course, uses a lot of I/O as
>it needs three data busses, one address bus for main memory, and two
>small address busses for stack memory plus associated control lines.

>Keith Wootten

Bernd Paysan

unread,

May 19, 1998, 3:00:00 AM5/19/98

to

Joseph H Allen wrote:
> >FWIW the PSC1000 does the above loop with three 8bit opcodes for up to
> >2^32 iterations, these three making 3/4 of a 32bit instruction group.
>
> Huh? Wouldn't you need something like:
>
> 1000 ; push destination address
> 2000 ; push source address
> loop:
> over ; dup destination
> over ; dup source
> @ ; replace source with contents
> ! ; write contents to destination
> 2 ; increment source
> add
> swap ; increment dest
> 2
> add
> swap
> ...
>
> I.E., 10 instructions to move each word (unless I'm really missing something
> about these 0-address top of stack machines).

No, you are just missing a few special case operations of the PSC1000.
There are two address registers (A and top of returnstack) which allow
addressing with postincrement mode. And there is a counter and a
decrement and branch if not zero instruction. The PSC1000 is not really
a MISC processor.

Keith Wootten

unread,

May 19, 1998, 3:00:00 AM5/19/98

to

In article <Et6Lo...@world.std.com>, Joseph H Allen

<jha...@world.std.com> writes
>In article <X0F+fHAU...@wootten.demon.co.uk>,
>Keith Wootten <Ke...@wootten.demon.co.uk> wrote:
>>In article <Et568...@world.std.com>, Joseph H Allen
>><jha...@world.std.com> writes
>>>In article <355f5eaf...@news.megsinet.net>, <msi...@tefbbs.com> wrote:
>>>

>>>>Read a book called "Stack Computers" by Phil Koopman available free on
>>>>the net.
>>>
>>>>Some of the ideas in this book are being used in a Java Engine by
>>>>Patriot Scientific.
>>>

>>>I've been thinking about these stack processors, but I'm still not
>>>convinced. They don't do particularly well with structure accesses or even
>>>with simple block copies, and they tend to require that a lot of the stack
>>>to be cached on the FPGA (so they're bigger).
>>
>>I'm not sure what you mean about simple block copies - surely this is
>>just something which a chip either does well or doesn't do well
>>depending on the hardware available, irrespective of whether it's a
>>stack machine or not?
>>
>>Load source address to register A
>>Load destination address to register B
>>Load count to register C
>>
>>Read data at A, post incrementing A
>>Write data at B, post incrementing B
>>Decrement C and repeat till zero.
>

>>FWIW the PSC1000 does the above loop with three 8bit opcodes for up to
>>2^32 iterations, these three making 3/4 of a 32bit instruction group.
>
>Huh? Wouldn't you need something like:
>
> 1000 ; push destination address
> 2000 ; push source address
>loop:
> over ; dup destination
> over ; dup source
> @ ; replace source with contents
> ! ; write contents to destination
> 2 ; increment source
> add
> swap ; increment dest
> 2
> add
> swap
> ...
>
>I.E., 10 instructions to move each word (unless I'm really missing something
>about these 0-address top of stack machines).

[snipped]

Yes, if your stack machine were to be *only* a hardware implementation
of the 'standard' Forth virtual machine. Stack machines (actual Silicon
ones) never implement exactly and only this or any other virtual
machine, but always add extra useful instructions.

eg on the PSC1000 (using Patriot's syntax and after loading the three
registers)

align to 4 byte boundary

copying
ld[x++] \ push TOS with (x) and increment x by 4
st[r0++] \ pop TOS to (r0) and increment r0 by 4
mloop copying \ decrement ct and goto copying if non-zero

which is three 8bit opcodes. The mloop instruction works with up to
three preceeding opcodes in the same 32bit (four instruction) memory
group. Other stack machines do differently, just as register based
machines do (eg Z80 block move), but I don't know of any which would be
limited to the Forth virtual machine. This is really my point - IMO
block moving ability is pretty much unconnected to the *fundamental*
chip architecture. In the above example, the stack isn't really used as
a stack, just a convenient transfer register.

Cheers
--
Keith Wootten

Mike Butts

unread,

May 19, 1998, 3:00:00 AM5/19/98

to

There are a number of good reasons for doing a small CPU in
an FPGA:

* Cost and size: With a single $20 chip, you get a CPU, which is
as fast or faster than common 8-bit or 16-bit embedded control
CPUs, *plus* you get all your peripheral logic for ports,
timers, UART, etc. For example, the 68HC11 is still very
commonly used in small robotics apps, and it only runs at
a few MHz. I expect at least 10 MHz from a carefully designed
FPGA-based CPU; 30 MHz or better with good pipelining.

* Performance: You can use reconfigurable computing techniques
to customize CPU instructions, and/or include "accelerator"
hardware to increase performance on your specific app.

* Education: The guts of the CPU and its peripherals are
completely open for inspection and modification. It's
an ideal teaching tool for undergraduates to get hands-on
exposure to hardware architecture and systems design.

* Satisfaction: Building a small computer from the gates up
through the instruction set architecture up through the
OS and compiler up to apps is a very satisfying exercise,
akin to building one's own small sailboat. It can be a
potent antidote to the shrink-wrapped proprietary HW and
SW we generally must live with.

--Mike

Peter wrote:
>
> I may have missed the original post, but may I ask why anyone wants to
> do this project? Is it just an exercise?
>
> Many people have thought about doing a CPU in an FPGA, but AFAIK it is
> always a futile exercise because one can buy a CPU with a given
> capability for far less than the cost of the FPGA.
>
> Certainly one can build a simple microsequenced device in an FPGA, but
> by the time one has added support for stacks, interrupts, and such,
> the logic grows too much.
>
> If someone wants to do an exercise, what *would* be useful is a public
> domain design for something good and common, e.g. a Z80, in either
> schematics or VHDL. Then one could incorporate this into an ASIC -
> that is where a CPU core really comes in handy.
>
> Peter.
>
> Return address is invalid to help stop junk mail.
> E-mail replies to zX...@digiYserve.com but
> remove the X and the Y.

Jan Gray

unread,

May 19, 1998, 3:00:00 AM5/19/98

to

Peter wrote in message <3561443d....@news.netcomuk.co.uk>...

>
>I may have missed the original post, but may I ask why anyone wants to
>do this project? Is it just an exercise?

>
>Many people have thought about doing a CPU in an FPGA, but AFAIK it is
>always a futile exercise because one can buy a CPU with a given
>capability for far less than the cost of the FPGA.

You're right. But what fun! I used to envy processor designers in industry
and academia. Now I can do my own processors, on-chip peripherals, cache,
etc. In fact, I have it far better. I can design the entire system, the
ISA, the microarchitecture, and get working hardware in a few days. In
contrast, the typical big company CPU designer works for months at a stretch
on a small piece of a huge and complex system. And there is a certain
pleasure in minimalism and self-sufficiency.

It is one thing to read about simple microarchitectures in H&P, it is
another to go build and debug and boot them. You can "squish the CLBs
between your toes" -- you become familiar with the same pipe stages, clock
speed, area, IPC tradeoffs, although your units are CLBs and ns rather than
rbes and ps.

The resulting designs are only as fast as seven year old commodity
processors, but that's OK. Maybe 20X a VAX is fast enough for your
application -- you don't need 200X a VAX. And whether you have a StrongARM,
an R4640, or a custom FPGA CPU, you are using the same external memory, more
or less -- cache misses still cost 100 ns.

True, commodity processors are cheaper on an absolute basis, especially if
you don't take into account total system cost. But FPGA prices are coming
down. By end of 1998, the Xilinx XCS20 will be $6.50 Q100K (ref:
http://www.xilinx.com/prs_rls/spartan.htm). This part, equivalent to the
XC4010 that hosts the J32 (1995), can implement a 33 MHz conventional
pipelined 32-bit RISC processor leaving 5,000 gates of logic for
system-on-chip peripherals. You will soon be able to build highly
integrated and customized glueless systems with just FPGA+SDRAM for ~$10.
And there is the soon-to-be-$3 XCS05, adequate for a nice little 10 MIPS
16-bit processor with logic to spare.

Implications/Predictions
(some from other folks)

* falling FPGA prices will eventually clamp an upper bound on the price of
many custom parts, including embedded CPUs

* RISC CPU design is no longer rocket science -- HDLs, tools, and the FPGA's
abstraction of all the hard EE, means that undergrads will increasingly
design their own processors. Of course, these designs will never complete
with commodity microprocessors for specmarks.

* a number of these designs will be published under GPL or put in the public
domain. There will be communities of users of certain free CPU designs,
similar to the open software movement. There will be GCC tools chains,
lunatic fringe Linux ports, etc.

* there will be free implementations of legacy ISAs. Or perhaps free
implementations of cross-assemblers/cross-loaders from legacy ISAs to
simplified minimalist FPGA CPU ISAs.

* embedded CPU vendors will start to ship with some FPGA on chip (Motorola
and Atmel have announced this).

Jan Gray
(J32 described at http://www3.sympatico.ca/jsgray/homebrew.htm)

Rickman

unread,

May 19, 1998, 3:00:00 AM5/19/98

to

Mike Butts wrote:
>
> There are a number of good reasons for doing a small CPU in
> an FPGA:
>
> * Cost and size: With a single $20 chip, you get a CPU, which is
> as fast or faster than common 8-bit or 16-bit embedded control
> CPUs, *plus* you get all your peripheral logic for ports,
> timers, UART, etc. For example, the 68HC11 is still very
> commonly used in small robotics apps, and it only runs at
> a few MHz. I expect at least 10 MHz from a carefully designed
> FPGA-based CPU; 30 MHz or better with good pipelining.

What about the 50 MHz versions of the 8051 I see advertised? Surely this
is as fast as the custom CPU and would allow you to use a smaller FPGA.
I suspect that the total cost would be lower for a standard CPU/small
FPGA than for a bigger FPGA with custom CPU.

> * Performance: You can use reconfigurable computing techniques
> to customize CPU instructions, and/or include "accelerator"
> hardware to increase performance on your specific app.

Reconfigurable computing certainly has a lot of possiblities. But I
haven't seen very much to support it. I do know that it can be very
difficult to debug. Everyone knows how tricky it can be to debug self
modifing code. Think about how hard it would be to debug a self modifing
processor!!
8-O

> * Education: The guts of the CPU and its peripherals are
> completely open for inspection and modification. It's
> an ideal teaching tool for undergraduates to get hands-on
> exposure to hardware architecture and systems design.

I think that is an excellent point. When I was in grad school, I wanted
to implement a simulation of a microcoded teaching CPU that they used
for a couple of classes. Until then it had only existed on paper. But
the simulation would not have allowed the students to modify or extend
the design easily. This would allow an EE student to play to his hearts
content, and he might actually learn something. ;-)

> * Satisfaction: Building a small computer from the gates up
> through the instruction set architecture up through the
> OS and compiler up to apps is a very satisfying exercise,
> akin to building one's own small sailboat. It can be a
> potent antidote to the shrink-wrapped proprietary HW and
> SW we generally must live with.
>
> --Mike

This is certainly a good reason to do a project in your spare time. The
education you would receive is not trivial either. It appears that a lot
of people have done this with a Forth implementation in mind. Do you
think they are doing this because of the personal satisfaction
motivation or do you think they have commercial motivations in mind?

--

Ray Andraka

unread,

May 19, 1998, 3:00:00 AM5/19/98

to

John Eaton wrote:

> Many FPGA's tout speeds up into the 100 Mhz's. It sounds great until you you realize

> and the only design that will actually run that fast is a shift register. This makes

> FPGA's the ideal canidate for a CPU design using serial logic. With serial logic all
> registers are serial shift registers and you can load anyone with a single bit mux.
> A 8 bit Full adder is simply a one bit full adder stage and one carry flip flop.
> It a great speed/logic trade off.
>
> John Eaton

Bit serial processing also translates to the routing structure nicely.
Most connections are very local, so the delays associated with long runs
and high fanouts don't exist, which means you can often get close to the
word rates of a fully parallel processor. The big thing to watch for if
you pack alot of serial logic into one part is the power dissipation.
The serial logic can often be clocked at or near the advertised toggle
frequency, and it is possible to present data that will cause nearly
every flop in the design to toggle on every clock.

--
-Ray Andraka, P.E.
President, the Andraka Consulting Group, Inc.
401/884-7930 Fax 401/884-7950
email rand...@ids.net
http://users.ids.net/~randraka

The Andraka Consulting Group is a digital hardware design firm
specializing in high performance FPGA designs for digital signal
processing, computing and control applications.

Mike Butts

unread,

May 20, 1998, 3:00:00 AM5/20/98

to

Rickman wrote:

> What about the 50 MHz versions of the 8051 I see advertised? Surely this
> is as fast as the custom CPU and would allow you to use a smaller FPGA.
> I suspect that the total cost would be lower for a standard CPU/small
> FPGA than for a bigger FPGA with custom CPU.

No question. Mostly the point of this little exercise is to do
CPU+logic
with the smallest FPGA. Total cost of FPGA + SW = $20. Plus a
hardwired
CPU would miss out on the reconfigurable processor angle, and would
place lower on the education and satisfaction scales. In any serious
commercial effort I don't think there's any question that silicon
is best spent on a fast hard-wired CPU, tightly coupled to FPGA.

> Reconfigurable computing certainly has a lot of possiblities. But I
> haven't seen very much to support it. I do know that it can be very
> difficult to debug. Everyone knows how tricky it can be to debug self
> modifing code. Think about how hard it would be to debug a self modifing
> processor!!

Oh, no, that's almost never the idea. Nearly all reconfigurable
computing
applications are compiled, or should I say cross-compiled, on
conventional
computers. One or more pre-compiled FPGA binaries are downloaded at
run time to get the computing done. Even that's hard enough. Someday
we
hope to have FCCMs doing their own place and route jobs at high speed,
but even that is still a research topic.

> It appears that a lot
> of people have done this with a Forth implementation in mind. Do you
> think they are doing this because of the personal satisfaction
> motivation or do you think they have commercial motivations in mind?

Oh, gee, you'd have to ask on comp.lang.forth, but I'm pretty sure it's
99% personal satisfaction. The author of SOD32, the Forth system that
the folks in Hong-Kong used for their FPGA-based Forth processor
<http://www.xs4all.nl/~lennartb/forth.html> says he "wrote it just for
the hack value". (Not forgetting that most good things in computing got
started for just their hack value.)

--Mike

am...@nsof.co.il-n0spam

unread,

May 20, 1998, 3:00:00 AM5/20/98

to

Mike Butts <mbu...@realizer.com> writes:

>There are a number of good reasons for doing a small CPU in
>an FPGA:

>* Cost and size: (...)
>* Performance: (...)
>* Education: (...)
>* Satisfaction: (...)

You've left out: *Versatility: Once the design is made, it's rather easy
to retarget it to the next version of chips, debug, add features, etc.

--
Amos Shapir
Paper: nSOF Parallel Software, Ltd.
Givat-Hashlosha 48800, Israel
Tel: +972 3 9388551 Fax: +972 3 9388552 GEO: 34 55 15 E / 32 05 52 N

Steven K. Knapp

unread,

May 20, 1998, 3:00:00 AM5/20/98

to

Those interested in a CPU+FPGA for fun or education should take a look at
the relatively inexpensive board offered by XESS Corp.
(http://www.xess.com/FPGA/ho05000.html). It has an 8031, a 32K SRAM, and
either an FPGA or CPLD (your choice). Prices range from US$160 to US$350,
depending on the FPGA and whether you qualify for an educational discount.

The XESS board is also supported by the Xilinx Student Edition book, which
contains the Xilinx design software on CD-ROM. The book retails for $87 and
is available for sale on-line (http://www.optimagic.com/books.html#Xilinx).

-----------------------------------------------------------
Steven K. Knapp
OptiMagic, Inc. -- "Great Designs Happen 'OptiMagic'-ally"
E-mail: skn...@optimagic.com
Web: http://www.optimagic.com
-----------------------------------------------------------

Mike Butts wrote in message <356262AB...@realizer.com>...

Arnim Littek

unread,

May 20, 1998, 3:00:00 AM5/20/98

to

In article <Et568...@world.std.com>,

Joseph H Allen <jha...@world.std.com> wrote:
>If I were using the XC4000 series they might be workable because of the
>relatively large stack cache. But if I'm going to assume an XC4000, I might
>as well assume wider memory paths, implement a MIPS or ARM clone and then
>have GCC available for it.

Since your original target was 5202, ie. bottom end, and having enough RAM
for stack cache is now arising, the newer baby Spartan parts might fit the
bill, as being in the same league, price/size-wise, but also having the
on-board distributed RAM. Dunno if the toy tools available in book/CD
formats include the Spartan parts as targets yet, though.

FWIW,

Arnim.
--
Arnim Littek ar...@actrix.gen.nz
Actrix Networks Ltd. fax +64-4-801-5335
uucp/PPP/SLIP/BBS accounts tel +64-4-801-5225

lar...@ibm.net

unread,

May 20, 1998, 3:00:00 AM5/20/98

to

CPUs in FPGAs are already being done at some schools. I
did a project as an undergrad at the University of Winsconsin,
that used Xilinx FPGAs to implement a full student-designed
ISA, from start to finish. There are a lot of limitation with using
FPGAs (as we found... and this was 3 years ago...). Perhaps
the biggest was bus size...

Our implementation used only 8 bit registers (it was a dual,
threaded processor). The whole project (6 FPGAs, external SRAM,
external UART) was wirewrapped (a wiring extravaganza!) and
ended up running at about 8MHz. Quite a technical achievement.

We were also responsible for developing a compiler/assembler
and working applications (the wonder and amazement generated
by Tetris running on a processor built by a group of college students
is probably a whole lot more than any non-EE could realize:).

I'm sure the class is still being taught since it was one of the most
popular in the EE department.

Regards,
Shane
--

>I think that is an excellent point. When I was in grad school, I wanted
>to implement a simulation of a microcoded teaching CPU that they used
>for a couple of classes. Until then it had only existed on paper. But
>the simulation would not have allowed the students to modify or extend
>the design easily. This would allow an EE student to play to his hearts
>content, and he might actually learn something. ;-)
>
>
>> * Satisfaction: Building a small computer from the gates up
>> through the instruction set architecture up through the
>> OS and compiler up to apps is a very satisfying exercise,
>> akin to building one's own small sailboat. It can be a
>> potent antidote to the shrink-wrapped proprietary HW and
>> SW we generally must live with.
>>
>> --Mike
>
>This is certainly a good reason to do a project in your spare time. The

>education you would receive is not trivial either. It appears that a lot

Hamish Moffatt

unread,

May 21, 1998, 3:00:00 AM5/21/98

to

In comp.arch.fpga Lar...@ibm.net <lar...@ibm.net> wrote:
> CPUs in FPGAs are already being done at some schools. I
> did a project as an undergrad at the University of Winsconsin,
> that used Xilinx FPGAs to implement a full student-designed
> ISA, from start to finish. There are a lot of limitation with using
> FPGAs (as we found... and this was 3 years ago...). Perhaps
> the biggest was bus size...

Here at RMIT (Melbourne, Australia) we did something similar. We implemented
the 6805 at the code-compatible level (not timing compatible), using
XC3090A parts, including wirewrapping the thing up. This was the data path
only, though; control logic was done by an external Am2901-based sequencer,
with custom software on PCs to control it all. This was the core third
year digital systems subject, and it was a lot of work at the time.

Just FYI,
Hamish
--
Hamish Moffatt, StudIEAust ham...@debian.org, ham...@moffatt.nu
Student, computer science & computer systems engineering. 4th year, RMIT.
http://hamish.home.ml.org/ (PGP key here) CPOM: [****** ] 66%
Nondeterminism means never having to say you are wrong.

Steven Groom

unread,

May 22, 1998, 3:00:00 AM5/22/98

to

Fitz wrote:

> Mike,
>
> Please keep us informed on your progress.
>
> Thanks,
> Fitz
>
> Mike Butts wrote:
>
> > In fact, a fine poster paper about a small FPGA-based processor
> > was given at FCCM last month. I'm so inspired by it, and by the
> > new availability of cheap FPGAs and tools, that I'm developing
> > a little minimum-cost FPGA+SRAM board and a similar little
> > processor design for it. I'm hoping to make it available by
> > the end of the year. (This is strictly a personal hobby effort
> > on my part, for fun, experimental and educational purposes.)

I have a board already constructed with an Altera part (10K10) for
USD$200 PC104 bus, 512KEPROM 32KSRAM, RS232 LCD Display and manual.

There will be ADC/DAC board options as well, in the next rev.

I have written a micro (actually several) for this. All public domain.
Currently writing CAN controller, UART and FIR/FFT functions (public
domain when complete)

The software to compile for this part is free from Altera
(www.altera.com)

Steve.

Joseph H Allen

unread,

May 22, 1998, 3:00:00 AM5/22/98

to

In article <3564B6A2...@ihug.co.nz>,
Steven Groom <steven...@arrow.co.nz> wrote:

>I have a board already constructed with an Altera part (10K10) for
>USD$200 PC104 bus, 512KEPROM 32KSRAM, RS232 LCD Display and manual.

>There will be ADC/DAC board options as well, in the next rev.

>I have written a micro (actually several) for this. All public domain.
>Currently writing CAN controller, UART and FIR/FFT functions (public
>domain when complete)

So what's it like? Where is it available?

Andrew Veliath

unread,

May 22, 1998, 3:00:00 AM5/22/98

to

.........----------------==================----
..--==- Fri, 22 May 1998 11:20:03 +1200,
..--==- Steven Groom (SG) mentioned:

SG) Fitz wrote:

)) Mike,
))
)) Please keep us informed on your progress.
))
)) Thanks, Fitz
))
)) Mike Butts wrote:
))
)) > In fact, a fine poster paper about a small FPGA-based processor
)) > was given at FCCM last month. I'm so inspired by it, and by the
)) > new availability of cheap FPGAs and tools, that I'm developing >
)) a little minimum-cost FPGA+SRAM board and a similar little >
)) processor design for it. I'm hoping to make it available by > the
)) end of the year. (This is strictly a personal hobby effort > on
)) my part, for fun, experimental and educational purposes.)

SG) I have a board already constructed with an Altera part (10K10)
SG) for USD$200 PC104 bus, 512KEPROM 32KSRAM, RS232 LCD Display and
SG) manual.

SG) There will be ADC/DAC board options as well, in the next rev.

SG) I have written a micro (actually several) for this. All public
SG) domain. Currently writing CAN controller, UART and FIR/FFT
SG) functions (public domain when complete)

SG) The software to compile for this part is free from Altera
SG) (www.altera.com)

..
~~~~~~~~~================-------------......---

At Rensselaer last semester for a class project, I designed a small
8-bit register to register RISC processor with a stack and subroutines
and a healthy instruction set, and was able to test it on an XC4010E,
with 16 general purpose registers and a small LUT RAM and some
peripherals. Total resource count was about 331 of the 400 CLBs. It
seemed to be stable at a couple tens of MHz.

I had used the Cygnus GNU Win32 tools flex and bison to write an
assembler for it and wrote some bit manipulation to go directly from
assembler to the INIT= statements for the internal RAMS on the XC4010E
which Xilinx uses in its user constraint files.

--

Andrew Veliath
andr...@usa.net, vel...@rpi.edu

Bernd Paysan

unread,

May 22, 1998, 3:00:00 AM5/22/98

to

Lar...@ibm.net wrote:
> We were also responsible for developing a compiler/assembler
> and working applications (the wonder and amazement generated
> by Tetris running on a processor built by a group of college students
> is probably a whole lot more than any non-EE could realize:).

Tetris seems to be the standard application for this sort of experience.
About a year ago, I was involved in a FPGA based CPU approach, and my
part was compiler support (it was a Forth compiler, and I spent only a
weekend to do it). But the application to test it was Tetris.

Thomas Womack

unread,

May 24, 1998, 3:00:00 AM5/24/98

to

In comp.arch Bernd Paysan <bernd....@remove.muenchen.this.org.junk> wrote:

: Lar...@ibm.net wrote:
: > We were also responsible for developing a compiler/assembler
: > and working applications (the wonder and amazement generated
: > by Tetris running on a processor built by a group of college students
: > is probably a whole lot more than any non-EE could realize:).

Yep; we had that at the Computation open day at Oxford yesterday, though
I think their Tetris was directly hardware-compiled rather than running
on an implemented microprocessor - and the board used was RAMless, which
was pretty impressive.

I was more impressed by 60fps interactive 800x600 Life - which is
non-trivial, and probably not even possible, on Standard Commercial
Hardware (OK, you can use cell lists rather than an array, but the
FPGA approach ran at constant speed independent of population). I
think the 64k of SSRAM attached to the FPGA helped.

Tom

Terje Mathisen

unread,

May 24, 1998, 3:00:00 AM5/24/98

to

Mike Abrash used Conway's Game of Life as the target for his second
'Annual Code Optimization Challenge', a few years ago.

The target machine was either a 486 or very early Pentium, i.e. current
hardware is quite a bit faster:

The two joint winners both achieved twice the performance of my entry,
which is still capable of 400 fps in 320x200 resolution on a Pentium
MMX.

With linear scaling, this corresponds to 53 fps in 800x600, and more
than 100 fps for the two winners. On a modern PII even my program would
handle 60fps without breaking a sweat. :-)

One of the winners, David Stafford, used a cell list to reduce the
working set, while the other, a german guy (sorry, I don't remember his
name), used logic operations on 32-bit registers to implement the Life
counting logic directly.

This also means that the german entry probably implemented very nearly
the same algorithm as the FPGA setup above. :-)

Terje

--
- <Terje.M...@hda.hydro.com>
Using self-discipline, see http://www.eiffel.com/discipline
"almost all programming can be viewed as an exercise in caching"

Terje Mathisen

unread,

May 25, 1998, 3:00:00 AM5/25/98

to Thomas Womack

Thomas Womack wrote:
>
> In article <35686F...@hda.hydro.com> you wrote:
>
> >I wrote
>
> : > I was more impressed by 60fps interactive 800x600 Life - which is

> : > non-trivial, and probably not even possible, on Standard Commercial
> : > Hardware (OK, you can use cell lists rather than an array, but the
> : > FPGA approach ran at constant speed independent of population). I
> : > think the 64k of SSRAM attached to the FPGA helped.
>
> : Mike Abrash used Conway's Game of Life as the target for his second
> : 'Annual Code Optimization Challenge', a few years ago.
>
> : The target machine was either a 486 or very early Pentium, i.e. current
> : hardware is quite a bit faster:
>
> : The two joint winners both achieved twice the performance of my entry,
> : which is still capable of 400 fps in 320x200 resolution on a Pentium
> : MMX.
>

> That seems very interesting; would it be worth putting you in touch with
> the appropriate bit of the Hardware Compilation Group?

Sure, no problem.

> Have you a better reference for this than chapters 17 and 18 of the
> Big Black Book from Abrash? He discusses Stafford's solution in detail,
> ignoring Peter Klerings' one altogether, which seems a pity since Klerings'
> one sounds much closer suited to the FPGA.

Klering's code was actually fairly straightforward, except for a set of
flags used to detect static areas.

Skipping that part would still (most probably) let it run at the
required 60 fps, the code is 'just' a parallel implementation of the
counting logic:

Alive next iteration = (alive now AND (count == 2 OR count == 3)) OR
(not alive AND count == 3),

which simplifies to just:

Alive next iteration = (count == 3) OR (alive AND count == 2).

By including the cell itself in the count, then it becomes easier to
reuse the counting logic for multiple rows:

alive = (iCount == 3) OR (alive AND iCount == 4)

You need 4 bits to count to 8 (or 9), so 4 registers for counting plus
one for the center cells leaves one or two registers for array
addressing on an x86.

Klering did a lot of work to simplify the logic as much as possible,
i.e. he didn't actually implement the full 'count-to-9' bitwise logic,
since it is possible to early-out many of the branches.

Implementing the same logic with MMX-style wide registers should make it
approximately twice as fast.

Torben AEgidius Mogensen

unread,

May 25, 1998, 3:00:00 AM5/25/98

to

Terje Mathisen <Terje.M...@hda.hydro.com> writes:

>Klering's code was actually fairly straightforward, except for a set of
>flags used to detect static areas.

>Skipping that part would still (most probably) let it run at the
>required 60 fps, the code is 'just' a parallel implementation of the
>counting logic:

>Alive next iteration = (alive now AND (count == 2 OR count == 3)) OR
> (not alive AND count == 3),

>which simplifies to just:

>Alive next iteration = (count == 3) OR (alive AND count == 2).

>By including the cell itself in the count, then it becomes easier to
>reuse the counting logic for multiple rows:

> alive = (iCount == 3) OR (alive AND iCount == 4)

>You need 4 bits to count to 8 (or 9), so 4 registers for counting plus
>one for the center cells leaves one or two registers for array
>addressing on an x86.

>Klering did a lot of work to simplify the logic as much as possible,
>i.e. he didn't actually implement the full 'count-to-9' bitwise logic,
>since it is possible to early-out many of the branches.

>Implementing the same logic with MMX-style wide registers should make it
>approximately twice as fast.

I'm not sure if the following is the same as Klerings code, but the
approach sounds similar. The code below is based on code I got from
David Seal and then optimized slightly (removing two binary
operations). Initially, the variable "middle" contains a bitvector of
a number of cells. "up" and "down" contains the rows over and below
the row in question. "left" and "right" contain the same as middle,
except that they are shifted one bit left or right (with the
appropriate neighbouring bits shifted in). Similarly for "upleft" etc.
At the end, "newmiddle" contains the new values for the row
corresponding to "middle".

ones = up ^ upleft;
twos = up & upleft;
carry = ones & upright;
ones = ones ^ upright;
twos = twos ^ carry;

ones1 = down ^ downleft;
twos1 = down & downleft;
carry = ones1 & downright;
ones1 = ones1 ^ downright;
twos1 = twos1 ^ carry;

carry = ones & ones1;
ones = ones ^ ones1;
fours = twos & twos1;
twos = twos ^ twos1;
carry1 = twos & carry;
twos = twos ^ carry;
fours = fours ^ carry1; /* could be | */

carry = ones & left;
ones = ones ^ left;
carry1 = ones & right;
ones = ones ^ right;
carry = carry | carry1;
twos = twos ^ carry;

ones = ones | middle;
newmiddle = ones & twos & ~fours;

If we assume that ^, &, | and &~ are available as single-cycle
operations this takes 26 cycles to complete, some of which can be done
in parallel on a superscalar machine. If we add in slightly over a
dozen cycles for computing "upleft" etc., we get 40 cycles per
word-sized bitvector. If we make sure only to read each memory word
once, we should be able to update a wordlength of cells in 40 cycles
plus the time it takes to read and write a word. If the read/write is
to non-cached meory (as it will be if it goes to memory mapped display
memory), a read will take a full memory cycle (though you can use
burst access) while the store can be through a write buffer and hence
take only one CPU cycle (which can be scheduled in parallel with other
operations). On a non-superscalar CPU (as e.g. StrongARM) with 200MHz
CPU and 33MHz memory, this works out to about 4 million words per
second, or 128 million cells/s. With a 800x600 display, we get more
than 250 frames per second. With a superscalar CPU and a non-blocking
cache, this can be improved considerably.

Torben Mogensen (tor...@diku.dk)

Tim Olson

unread,

May 25, 1998, 3:00:00 AM5/25/98

to

In article <6kbnki$4...@grimer.diku.dk>, tor...@diku.dk (Torben AEgidius
Mogensen) wrote:

I'm not sure if the following is the same as Klerings code, but the

This algorithm looks like the one described in the Smalltalk "blue book",
where a version of Life was implemented using BitBlt operations to
implement the cell counting in parallel.

--

-- Tim Olson

Terje Mathisen

unread,

May 25, 1998, 3:00:00 AM5/25/98

to

Torben AEgidius Mogensen wrote:
[snip]

> I'm not sure if the following is the same as Klerings code, but the

> approach sounds similar. The code below is based on code I got from
> David Seal and then optimized slightly (removing two binary
> operations). Initially, the variable "middle" contains a bitvector of
> a number of cells. "up" and "down" contains the rows over and below
> the row in question. "left" and "right" contain the same as middle,
> except that they are shifted one bit left or right (with the
> appropriate neighbouring bits shifted in). Similarly for "upleft" etc.
> At the end, "newmiddle" contains the new values for the row
> corresponding to "middle".
>

> If we assume that ^, &, | and &~ are available as single-cycle

All of these except &~ (NAND) are available on all cpus I know, and on
those which miss out, you can of course synthesize it in two cycles.

> operations this takes 26 cycles to complete, some of which can be done
> in parallel on a superscalar machine. If we add in slightly over a

Actually, it should be quite easy to get close to 2 IPC, because there's
a lot of independent operations all the way to the end.

> dozen cycles for computing "upleft" etc., we get 40 cycles per
> word-sized bitvector. If we make sure only to read each memory word
> once, we should be able to update a wordlength of cells in 40 cycles
> plus the time it takes to read and write a word. If the read/write is
> to non-cached meory (as it will be if it goes to memory mapped display
> memory), a read will take a full memory cycle (though you can use
> burst access) while the store can be through a write buffer and hence

As you've just discovered, to get max speed you must maintain a back
buffer in RAM, and then write updated blocks to the display.

It is also critical to have a one bit/pixel display mode, because
otherwise you'll be totally limited by write bandwidth.

I.e. working in 32-bit true color will increase the size of a full
screen buffer from 64K to 2MB.

The 120 MB/sec required write speed (for 60 fps) will definitely
overload a PCI bus, which has a (very) theoretical max speed of 133
MB/sec on a long burst.

> take only one CPU cycle (which can be scheduled in parallel with other
> operations). On a non-superscalar CPU (as e.g. StrongARM) with 200MHz
> CPU and 33MHz memory, this works out to about 4 million words per
> second, or 128 million cells/s. With a 800x600 display, we get more
> than 250 frames per second. With a superscalar CPU and a non-blocking
> cache, this can be improved considerably.

Anyway, this is the important point: A general cpu is more than fast
enough to handle this problem at full frame rate, as long as the code is
properly optimized.

When you've optimized the code, then you'll discover that the problem
really is memory bandwidth and nothing else.

On the regular VGA cards we had to target, writing a single pixel was so
slow that it was critical to minimize the number of writes to just those
pixels that actually changed.

In my code I stored 4 cells plus the neighborhood counts in a 16-bit
word, and then used a 64K lookup table to convert the current value to
the new state. If any of the 4 cells changed state, then I would write
the updated pixel(s) to display memory, and then lookup a pair of 32-bit
increments/decrements: The values needed to update the status of the
current line, and the lines above/below.

My program actually used less instructions/iteration than both Stafford
and Klerings entries, but they both blew me away by keeping their
working sets small enough to fit (mostly) in the 8K L1 cache!

Some time after this I formulated my .sig. :-)

Jan Gray

unread,

May 25, 1998, 3:00:00 AM5/25/98

to

Tim Olson wrote

>This algorithm looks like the one described in the Smalltalk "blue book",
>where a version of Life was implemented using BitBlt operations to
>implement the cell counting in parallel.

Another reference to the bitwise parallel approach is "Life Algorithms",
Mark Niemiec, Byte, Jan. 1979, pp 70-79. If I recall correctly, Mark, David
Buckingham, and friends, used Waterloo's Honeywell 66/60's EIS "move
mountain" instructions to animate 64K 36-bit words per iteration.

Inspired by Buckingham and the Blue Book, I wrote a bitblt version that did
800,000 cells in 34? bitblts on a Perq in 1983? and one that did 400,000
cells/s on an 8 MHz (1 M 2-operand 32-bit op/s) 68000 in 1985.

As Messrs. Mathisen and Mogensen describe, Life should run very fast on
modern processors (superscalar and multimedia enhanced and large caches).
64-bits, in 40 insns, in perhaps 15-20 clocks, at 3 ns/clock, e.g. 1 bit/ns.

FPGA Implementation: It is straightforward to run at full memory bandwidth.
For example, given an XCS20 and a 32Kx32 PBSRAM (32-bits in or out per 15 ns
clock) we can approach 32 bits/(2*15) ns, e.g. 1 bit/ns.

Since a given line is read three times (as "below", "current", and "above"),
we buffer 2 lines of cells in RAM in the FPGA. A 1024 x n playfield
requires 2 x 1024 bits = 64 CLBs of single port RAM, and preferably 3 x 1024
bits for 3 banks since each clock you must read from up to two lines and
write to a third.

Detailed design/floor plan. One bit requires approx. 9 CLBs. Assuming the
cell neighbours are (a,b,c,d,e,f,g,h), we need :-
3 CLBs RAM -- 3 32x1 RAMs (3 banks of line buffer)
6 CLBs logic --
1 s0a=a^b^c^d; s0e=e^f^g^h
2 s1a="a+b+c+d == 2 or 3"; s1e="e+f+g+h == 2 or 3"
3 s2a=a&b&c&d; s2e=e&f&g&h
4,5 (s3,s2,s1,s0)=(s2a,s1a,s0a) + (s2e,s1e,s0e) (uses dedicated carry
logic)
6 new = ~s3&~s2&s1&(s0|old)
and so in a 20x20 CLB XCS20, we explicitly place 16 rows of 1x9 CLB tiles in
the left half, another 16 in the right half, leaving plenty of room to spare
for control and address generation.

At the 1997 FPGAs for Custom Computing Machines conference., the paper "The
RAW Benchmark Suite" by Babb et al proposed a set of benchmarks for
comparing reconfigurable computing systems. One of the 12 benchmarks was
Life, for which they reported speedups of several hundred times over a
SparcStation20+software approach, but in fairness, they write "we are not
currently implementing known improvements to the software to take advantage
of the bit-level parallelism available in a microprocessor".

Summary. Hypotheticially...
Fast microprocessor + cache: ~1 bit/ns
Single FPGA + SRAM custom machine: ~1 bit/ns

Jan Gray

Tom Rokicki

unread,

May 25, 1998, 3:00:00 AM5/25/98

to

> ones = up ^ upleft;
> twos = up & upleft;
> carry = ones & upright;

> . . .

You can beat this (in terms of number of logical operations and shifts)
by quite a bit. Here, `g' is the original, and only input.

sl3=(sl2=(a=left(g))^(b=right(g)))^g
sh3=(sh2=a&b)|(sl2&g)

sll=(a=up(sl3)^(b=down(sl3)))^sl2
slh=(a|(b^sl2))^sll
a=up(sh3)^(b=down(sh3))
g=(a^sh2^slh)&((a|(b^sh2))^slh)&(sll|g)

I believe that's 19 logical operations, one left, one right, two ups
and two downs (this assumes the ups and downs are cheap).

> Actually, it should be quite easy to get close to 2 IPC, because there's
> a lot of independent operations all the way to the end.

You bet!

> As you've just discovered, to get max speed you must maintain a back
> buffer in RAM, and then write updated blocks to the display.

Yep, and you need to block the algorithm appropriately so it fits in cache.
This is pretty easy to do; just do the above algorithm in appropriately
sized strips. Further, it's pretty easy to block out (not process)
areas that are static or oscillating with period 2 (which are terribly
common in Life); I generally use two alternating buffers and keep a
`superbitmap' of those chunks that are changing with period >2.

> The 120 MB/sec required write speed (for 60 fps) will definitely
> overload a PCI bus, which has a (very) theoretical max speed of 133
> MB/sec on a long burst.

Which is why you do the delta. Indeed, what I did is `stupider' than
that. There's no sense updating the display at greater than the
frame rate but it's easy to calculate at greater than frame rate.
So I don't update on every generation, just on every frame. And then
I only update the deltas, which are often quite small compared to the
real data.

> When you've optimized the code, then you'll discover that the problem
> really is memory bandwidth and nothing else.

I'm not so sure about this; it's pretty easy to make the loads/stores
overlap pretty well. Of course, I did it on the 68000 where there are
enough registers; I'm not sure about the x86 world.

The above code was completely designed by me, although I'm sure others
have found a similar solution.

(I actually implemented the above on an HP calculator in user-RPL, and
on the Amiga, both with the blitter and in assembler. I keep meaning
to get around to speeding up xlife but never can seem to find the time.)

Here's 48G code for anyone who cares; just put a GROB on the stack and
hit `GEN':

GEN << WHILE 1 REPEAT DUP ->LCD GEN1 END >>
GEN1 << {#0 #1} DUP2 SH OVER LX ROT REVLIST
SWAP OVER SH 5 ROLLD 4 ROLLD SH 4 PICK LX
ROT 3 PICK + NEG + 4 ROLLD NEG + LX + NEG >>
SH << DUP2 OVER DUP ROT {#FFFh #FFFh} SUB
LX 3 DUPN 7 ROLLD GXOR 5 ROLLD GXOR + >>
LX << {#0 #0} SWAP GXOR >>

-tom

Matt Aubury

unread,

May 26, 1998, 3:00:00 AM5/26/98

to

Terje Mathisen wrote:
> Anyway, this is the important point: A general cpu is more than fast
> enough to handle this problem at full frame rate, as long as the code is
> properly optimized.

I've only just seen this discussion, but I'm the guy who wrote the
demonstration at Oxford. I should emphasise that I wasn't trying to
show the power of FPGAs particularly, more just trying to make an
attractive demo, so getting it running at the full frame rate was my
sole aim also. It is trivial to compute more than one cell per cycle,
but as you rightly point out the problem will quickly become one of
memory bandwidth. We the system we're using, I could probably get 10
cells per cycle (giving us around 600fps) before that became the
bottleneck.

I recently extended the program in a different way: I wrote a fairly
general cellular automata harness in which each cell has four bits of
state, and just about any automata you want to try out can simply be
plugged in. These should all then run at the full frame rate, as you
have around four or five levels of pipelining available. I think
multistate automatas like this could present a pretty difficult
challenge for conventional processors: it really is just one of those
things that FPGAs are very good at.

Shameless plug: the general automata (including memory interfaces, VGA
display and serial mouse interface for interacting with the automatas)
was done in around 700 lines of Handel-C code. If anybody wants a
copy, I'll gladly send it out (send mail to m...@comlab.ox.ac.uk). The
Hardware Compilation Group homepage is at:

http://www.comlab.ox.ac.uk/oucl/hwcomp.html

Cheers,
Matt

--
Matt Aubury, Oxford University Computing Laboratory

Robert Bernecky

unread,

May 26, 1998, 3:00:00 AM5/26/98

to Jan Gray

Another parallel implementation of Conway's Life is given
by Eugene McDonnell in "Life: Nasty, Brutish, and Short",
in ACM SIGAPL APL88 Conference Proceedings. Eugene evolves
a number of algorithms in dialects of APL, ending up with
a 9-token expression for one iteration.

Bob

Ian_Ameline

unread,

May 26, 1998, 3:00:00 AM5/26/98

to

Robert Bernecky wrote:
>
> Another parallel implementation of Conway's Life is given
> by Eugene McDonnell in "Life: Nasty, Brutish, and Short",
> in ACM SIGAPL APL88 Conference Proceedings. Eugene evolves
> a number of algorithms in dialects of APL, ending up with
> a 9-token expression for one iteration.
>

The most parallel implementation I've ever seen is "Life in the Stencil
Buffer" on page 407 of the OpenGL Programming Guide.

--
Regards, | No sense being pessimistic --
Ian Ameline, | It wouldn't work anyway.
Senior Software Engineer, |
Alias/Wavefront |

Terje Mathisen

unread,

May 26, 1998, 3:00:00 AM5/26/98

to

Matt Aubury wrote:
>
> Terje Mathisen wrote:
> > Anyway, this is the important point: A general cpu is more than fast
> > enough to handle this problem at full frame rate, as long as the code is
> > properly optimized.
>
> I've only just seen this discussion, but I'm the guy who wrote the
> demonstration at Oxford. I should emphasise that I wasn't trying to
> show the power of FPGAs particularly, more just trying to make an
> attractive demo, so getting it running at the full frame rate was my
> sole aim also. It is trivial to compute more than one cell per cycle,
> but as you rightly point out the problem will quickly become one of
> memory bandwidth. We the system we're using, I could probably get 10
> cells per cycle (giving us around 600fps) before that became the
> bottleneck.

Nice! :-)

> I recently extended the program in a different way: I wrote a fairly
> general cellular automata harness in which each cell has four bits of
> state, and just about any automata you want to try out can simply be
> plugged in. These should all then run at the full frame rate, as you
> have around four or five levels of pipelining available. I think
> multistate automatas like this could present a pretty difficult
> challenge for conventional processors: it really is just one of those
> things that FPGAs are very good at.

This could be solved with runtime code generation as well, compiling an
optimized set of binary logic ops on the fly, or (much simpler), by
embedding the cell rules in lookup tables.

This is actually one of my favourite ways to solve many kinds of
problems, generating one or more tables at runtime, which implements all
the required logic.

This is basically a state machine, which will almost always run at
whatever speed the tables can support the (nested) lookups.

My single favourite program is a 16-bit version of Word Count, which
handles user-specified word and line separators, i.e. it can handle both
CR and LF by themselves, as well as the CRLF combination.

This program processes 256 chars in the inner loop, with zero
compare/test/branch operations. It needs just 1.5 instructions/byte, so
it is probably faster than any kind of disk or even main memory! :-)

> Hardware Compilation Group homepage is at:
>
> http://www.comlab.ox.ac.uk/oucl/hwcomp.html

Interesting, although it seems like some the sample applications didn't
run too well, i.e. the real-time video image warper is a much simpler
application than a sw MPEG-2 decoder, and it still ran at just 9 fps.

Is this due to poorly optimized (handel-C) source code, or just that the
task isn't very well suited to an FPGA implementation?

Matt Aubury

unread,

May 26, 1998, 3:00:00 AM5/26/98

to

Terje Mathisen wrote:

> Matt Aubury wrote:
> > I recently extended the program in a different way: I wrote a fairly
> > general cellular automata harness in which each cell has four bits of
> > state, and just about any automata you want to try out can simply be
> > plugged in. These should all then run at the full frame rate, as you
> > have around four or five levels of pipelining available. I think
> > multistate automatas like this could present a pretty difficult
> > challenge for conventional processors: it really is just one of those
> > things that FPGAs are very good at.
>
> This could be solved with runtime code generation as well, compiling an
> optimized set of binary logic ops on the fly, or (much simpler), by
> embedding the cell rules in lookup tables.

I think you're going to have a problem there: the total state coming
into the lookup table is going to be the eight neighbours plus the
central cell, each with four bits of state, so thats 36 bits of input
data to 4 bits of output. A 32 GB lookup table might be a touch
cumbersome! :-)

Runtime code generation might well work; but it isn't exactly easy. I
wonder how well a partial evaluator, like TEMPO
(http://www.irisa.fr/compose/tempo/), would work on a problem like
this...

> > Hardware Compilation Group homepage is at:
> > http://www.comlab.ox.ac.uk/oucl/hwcomp.html
>
> Interesting, although it seems like some the sample applications didn't
> run too well, i.e. the real-time video image warper is a much simpler
> application than a sw MPEG-2 decoder, and it still ran at just 9 fps.

Ack!

> Is this due to poorly optimized (handel-C) source code, or just that
> the task isn't very well suited to an FPGA implementation?

In the case of that particular version the problem was with the host
board and its interface. Since that demo was created, I've written my
own version which runs happily a 60fps (although that's running on a
static image it would be fairly trivial to extend it to video). We
have an realtime warp of Bill Gates' face which entertains quite a few
visitors!

Bruce Hoult

unread,

May 27, 1998, 3:00:00 AM5/27/98

to

m...@comlab.ox.ac.uk (Matt Aubury) writes:

> Terje Mathisen wrote:
> > This could be solved with runtime code generation as well, compiling an
> > optimized set of binary logic ops on the fly, or (much simpler), by
> > embedding the cell rules in lookup tables.
>
> I think you're going to have a problem there: the total state coming
> into the lookup table is going to be the eight neighbours plus the
> central cell, each with four bits of state, so thats 36 bits of input
> data to 4 bits of output. A 32 GB lookup table might be a touch
> cumbersome! :-)

While that's true, I'm sure your hardware solution isn't using a
full width sum-of-products implementation either. Anywhere you use
cascaded logic, a software implementation can use the exact same
cascaded logic or cascaded table lookups.

-- Bruce

--
'We have no intention of shipping another bloated operating system and
forcing that down the throats of our Windows customers'
-- Paul Maritz, Microsoft Group Vice President

Terje Mathisen

unread,

May 27, 1998, 3:00:00 AM5/27/98

to

Bruce Hoult wrote:
>
> m...@comlab.ox.ac.uk (Matt Aubury) writes:
> > Terje Mathisen wrote:
> > > This could be solved with runtime code generation as well, compiling an
> > > optimized set of binary logic ops on the fly, or (much simpler), by
> > > embedding the cell rules in lookup tables.
> >
> > I think you're going to have a problem there: the total state coming
> > into the lookup table is going to be the eight neighbours plus the
> > central cell, each with four bits of state, so thats 36 bits of input
> > data to 4 bits of output. A 32 GB lookup table might be a touch
> > cumbersome! :-)

Actually, it isn't quite so bad: Since the output is just 4 bits, I'd
pack two of them into a single byte, so my table would be "only" 16GB.
:-)

> While that's true, I'm sure your hardware solution isn't using a
> full width sum-of-products implementation either. Anywhere you use
> cascaded logic, a software implementation can use the exact same
> cascaded logic or cascaded table lookups.

That is of course the way to implement it.

I.e. the word counter I mentioned uses one table to classify pairs of
input chars, combines this 4-bit value with the result from the previous
pair, and then uses another table to lookup the corresponding line/word
increments which gets added to the running (block) total.

Zahid Hussain

unread,

May 27, 1998, 3:00:00 AM5/27/98

to

Since we are reminiscing, I recall Life on CLIP4 in 1984. CLIP4
was a 96x96 SIMD array processor build at University College
London. The processors ran at about 1MHz. We had to slow the
code by about 500x to see the display :-) The code I suspect
was written by Paul Otto and David Renolds (I did not join
the group until 1984).

Regards,
Zahid
--
Zahid Hussain, BSc (Hons), PhD (Lond.) E-mail: zh...@daldd.sc.ti.com
3D Graphics Software Architect Tel: (972) 480-2864
Texas Instruments Inc. Fax: (972) 480-6303
8505 Forest Lane, Dallas, TX, USA MS: 8724, MSGID: ZHUS

Tim Tyler

unread,

May 28, 1998, 3:00:00 AM5/28/98

to

Terje Mathisen (Terje.M...@hda.hydro.com) wrote:

: Bruce Hoult wrote:
: >
: > m...@comlab.ox.ac.uk (Matt Aubury) writes:
: > > Terje Mathisen wrote:
: > > > This could be solved with runtime code generation as well, compiling an
: > > > optimized set of binary logic ops on the fly, or (much simpler), by
: > > > embedding the cell rules in lookup tables.
: > >
: > > I think you're going to have a problem there: the total state coming
: > > into the lookup table is going to be the eight neighbours plus the
: > > central cell, each with four bits of state, so thats 36 bits of input
: > > data to 4 bits of output. A 32 GB lookup table might be a touch
: > > cumbersome! :-)

: Actually, it isn't quite so bad: Since the output is just 4 bits, I'd
: pack two of them into a single byte, so my table would be "only" 16GB.
: :-)

Number of bits required = 2^Inputs * Outputs
= 2^36 * 4
= 2^38 bits
= 2^35 bytes
2^30 bytes = 1GB. 2^5 = 32 - thus the table required is 32GB as stated
originally...?

Using a lookup table of either size in a FPGA is impractical anyway.
--
__________
|im |yler The Mandala Centre http://www.mandala.co.uk t...@cryogen.com

Terje Mathisen

unread,

May 28, 1998, 3:00:00 AM5/28/98

to

Tim Tyler wrote:
>
> Terje Mathisen (Terje.M...@hda.hydro.com) wrote:
[snip]

> : Actually, it isn't quite so bad: Since the output is just 4 bits, I'd
> : pack two of them into a single byte, so my table would be "only" 16GB.
> : :-)
>
> Number of bits required = 2^Inputs * Outputs
> = 2^36 * 4
> = 2^38 bits
> = 2^35 bytes
> 2^30 bytes = 1GB. 2^5 = 32 - thus the table required is 32GB as stated
> originally...?

You're absolutely right, I just saw the 32GB number and 'intuitively'
knew that that meant 1 byte/entry, without bothering to check my
assumption.

Anyway, as you also state:

> Using a lookup table of either size in a FPGA is impractical anyway.

Very impractical indeed. :-)

I assume that the best (smallest) possible implementation of a table on
an FPGA cannot be simpler than a DRAM cell, so it would have been quite
a breakthrough to be able to put a 32GB table on a chip. :-)