Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.

Dismiss

Status of FPGA Transputer

309 views

Skip to first unread message

DerekS...@frontiernet.net

unread,

Feb 20, 2005, 1:38:29 AM2/20/05

>From time to time I remember people mentioning that they were working
on a FPGA implementation of a transputer. Looking back at older post
and the links on Ram's web page I find that most of those web pages
have expired.

I was wondering if anybody was still working on this or even
interested.

Thanks,
Derek

rmee...@olf.com

unread,

Feb 20, 2005, 7:19:12 AM2/20/05

I havent had the time to update my website (plus we are currently in
the process of moving servers) but the following website will point to
a WOTUG conference paper on an actually implemented T425 FPGA
transputer:

http://tmubdell.phys.metro-u.ac.jp/
yutaka/presentation/2004Design_Conference.pdf
http://www.wotug.org/cpa2004/papers/361-tanaka.pdf

I tried to contact the author, but as of yet, I still havent gotten a
response :-(

Also, as a general note, you should always try www.archive.org if a
link becomes expired as usually, it will be cached there. Comes in
very handy...

Cheers,

Ram

JJ

unread,

Feb 22, 2005, 6:04:08 AM2/22/05

Hi Ram, Derek, others

The Tanaka paper looks quite interesting but for only 24Mhz perf is a
little disappointing and there is the issue of ST's IP rights etc.
The real pt is that it is just a beginning to a future clean sheet
design of Transputer inspired Occam capable CPU. To make a dent, the
last 15yrs of x86 improvement must be made up some in any FPGA
Transputer, not just a 1 to 1 mapping. I have said this before that
FPGAs don't give good performance when used for cloning old designs,
there are also low perf 68K, Z8000, x86 cores out there too.

I am getting much closer now to some sort of announcement, delivery, so
might as well say something now.

This project has lasted almost 4yrs now and I have worked on 4
compilers & 3 CPU archs alternating every few months to move forward on
SW-HW.

The 1st arch I called T2 was ASIC/VLSI minded (after all that's what
I have done since my Inmos days) and was not suitable for FPGA. It was
also a mem-reg type design and would have become endlessly complex and
also slow-big in FPGA. When dumped into WebPack, xxxx hit the fan, end
of story.

The 2nd arch I called R3 and was directly designed for FPGA esp.
Virtex, I also commented on it in this and comp.arch, and .fpga NGs
from time to time. By the time it could execute small programs in the
cycle C model and was implemented in WebPack, the performance started
to drop from the initial 300MHz datapath synth results to the final
70-90MHz P/R post result. The datapaths were all fine, the control
logic got too long even if only about 10 levels of arithmetic-ctl
logic. This would be no problemo in ASIC.

Lesson learned, FPGAs are really good at datapaths hence their DSP
success at near 300MHz but control logic is an order of magnitude
behind relative to what an ASIC mix of datapath+control would be.
It's easy for ASIC or full custom VLSI to do control logic as fast as
math as all tricks are available esp. transfer gates (see P4-Athlon).
FPGAs only have built in tricks for adds, muls, mems, PLLs etc. For an
FPGA CPU to be fast, it must be truly RISC!

R16 the current design is a step back with some big simplifications but
is in fact mostly the same user ISA as R3. The opcodes are almost the
same look, different implementation.

R3 & R16 are both multi threaded 4 way barrel engines using classic
ld/st large register sets. R3 (after my 3 girls) was much more
ambitious about the threading, it actually kept 16 threads in the air,
4 were in the datapath, 4 were in the instruction fetch, and 8 or so
were in the queue. R3 would cycle through the 4 threads in the datapath
taking from the head of a instruction FIFO. As exceptions arose such as
bra & mem traffic, threads would be swapped. The FIFO tail was
continuously refilled with 4 other threads. The problem was that the
control logic had to keep the 4 threads at both ends unique and that
meant too much logic depth. Also R3 was ambitious about grouping 1 or 2
(or even more) bra ops with non-bra ops so the IPC was supposed to be
1.3 with high bra frequency code. On the simulation though this would
drop back to IPC 1 because of the wait states inserted before threads
could be swapped.

R3 was also heading way past 1000FFs before any cache added in, and it
was starting to dawn on me that the datapath which was also a CSA
design was producing less results for the total amount of HW stacking
up. A DSP designer would be looking at a design and expect all
datapaths to be working usefully most of the time, a CPU designer
expects a much lower utilization of resources.

R16 has been much more successful at keeping FPGA speed up and logic-FF
count down because its a rigid 4way threaded design that is more serial
by design. Its much closer to a DSP engine with few smarts in the
control logic.

In both designs, the arch is organized as a PE-MM pair or even multiple
PEs+MM. The no of PEs per MM is determined by MM memory throughput
which varies for SRAM, SDRAM & esp. RLDRAM.

The R16 PE is named so because it deliberately uses 2 clock cycles to
do every micro cycle, so 1 simple 16b datapath suffices to do most 32b
ops, and 1 BlockRam can now have 4 logical ports (RRWI). R3 was much
more expensive since it used a pair of 1 cycle BlockRam to simulate a
3port reg file RRW with opcodes taken from any unused ports i.e.
complex.

An R8 design was also contemplated but the datapath to control logic
ratio gives even less performance for slightly less HW. An R32 design
on the same principle would give 64b ops every 2cycles for a fairly
efficient design but adder wise slightly slower.

R16 also does no grouping and lives with a fixed IPC closer to 1 micro
cycle (or 0.5 if real clocks counted).

The results so far for R16 are around 320MHz in V2Pro-7 and maybe
200MHz in SP3-5 after P/R, but that was 2months ago. That would be
about 150-100 simple Mips or so. There is more new Verilog code to be
verified for the instruction cache and memory manager but I don't
expect any problems there.

I now have an SP3 board from Xilinx (the $100 special) that is waiting
to be uploaded. It has the SP3-200? and 1Mbyte of 10ns SRAM, but no
SDRAM etc.

R16 was last seen at about 400FFs and heading up to 500FFs or so. The
layout is hand placed as a 16b stripe that extends out from the
BlockRam, 16 rows by about 30 cols. If all goes well, the stripes can
just be replicated across the FPGA but the shared MM will no doubt mess
up the floor plan. I am just guessing 1 CPU will draw about 1W, so a
couple of them may warm up even a small FPGA. I originally hoped to
count up to 100 PEs in a huge FPGA, but it is more practical to put a
more modest no into the middle size parts, that's gets more IO pins
and more BlockRams and distributes the heat & power over a board all
for less system $.

I have tested the cycle model (which is pseudo Verilog) with 1B cycles
of randomly generated random branchy code to check out cache & branch
logic which is where almost all the complexity lies and it now looks
good. I have to actualize the memory interface design for SRAM, SDRAM,
and one day RLDRAM which is where the real performance/effort lays.

Basic math ops take 1 micro cycle, bcc (M68K style) take 1 if pass, 2
if taken near and more if outside I cache.

Ld/St take 2 micro cycles for SRAM, but DRAM/RLDRAM will be a few more.

For user written asm code, I am currently implementing a cheapo VC6++
preprocessor driven assembler that will also be the backend for the C
compiler that's been on hold while R16 got done. This assembler is
built into the cycle simulator and will likely pick up the last
compiler sources later. In an odd way the final package may look like a
compiler + runtime where the runtime is an RTL CPU simulation. of about
1Mips/PC GHz. A much faster rule driven ISA simulator could be done
that would be closer to x86/10-20 or so but that's for later, and
more for validation.

On the SW side, there have been 4 compilers; the 1st was a Verilog to C
cycle compiler used in my previous job. The 2nd compiler used some of
that work with C Occam Verilog support combined but did not reach a
stable tree stage; it also had C exprn problems which conflict with
Verilog. Work then moved to T2. The 3rd compiler was largely based on
the Lcc compiler and made much more progress but T2 was dead. Time pt
Jan 2004. Work then started on R3. When R3 started to look heavy in Aug
04, I went back to the 4th compiler, still Lcc influenced but stripped
down a bit (temp lost its preprocessor) and it produces RPN, no native
code emit yet. Since the HW was still not there, Nov 04 I went back and
started R16 out of the R3 best parts.

Since R16 is now getting an assembler that supports some HLL blocks
etc, it will meet back up with the compiler but 1st I must bring up R16
on a board with some asm test programs. When that's done I will
probably release this under a dual GPL/commercial Trolltech like
license model. If it gets used for $ use, I'd like some return on the
effort but license fees will be in same ballpark as other FPGA CPUs and
far less than ARM. Non $ use would use the GPL license.

Of course it won't be so great at FP either until an FPU design is
added later but there are a few commercial FPU cores out there now. The
ISA defines & implements 2/3 of the IS space so there are lots of
unused slots for new user ops. Like the original, it uses prefixes but
all codes are 16b long but with up to 3 prefixes ops can be up to 64b
long.

The 1st prefix is free in time but 1 extra word. The 2nd & 3rd prefix
adds an extra micro cycle.

There are only 2 formats, RRR & RL.
The RRR format is something like add Rz = Rx + Ry where z,x,y are 3b
fields. A 16b opcode can define 128 of these sorts of codes but only 48
or so are defined now. Each prefix adds 3more bits to z,x,y. A 1st
prefix takes the reg space up to 64 regs, which can still fit into the
BlockRam regfile. Registers after that would be mapped onto memory
space in the current workspace and involve hidden ld/st operations so
now it looks like a mem-mem ISA again. Add R400 = R500+R501 might well
take 1+2+0+2 micro cycles but remember the latency of memory is
effectively divided by the 4 way threading.

The RL format is something like adi Rz += L where z is same 3b but L is
8b signed no. Now a prefix adds 3b to z and 8b to L. The RL form is
mostly used for ld, add, cmp and bcc.

Like many RISCs, R0 is special and usually == 0. I add 1 twist, or a
sticky bit to R0. R0 can be written & read but after a read, next read
gives 0 again. So R0 can be used as 1 shot use temp but otherwise
returns 0.

Now serial grouping allows for ldi R0 = 0x12345678; followed by add Rz
= R0 + Ry to appear as 1 combined op with out any means to put literals
into any R src having to be provided. In R16 this would be 3 micro
cycles, but it could be reduced to 1 with a more complex grouper.
Perhaps bcc could be grouped also.

Since R16 should be able to run well in FPGA (being more DSP like), it
should really rock in ASIC as its cycle limit is mainly how fast a dual
port 2k SRAM can cycle and from several ASIC foundries I can see 1GHz
as doable (but 500Mips from 2clocks). If a pair of PEs is put back to
back out of phase, the CPU now looks like 8 threads every 8 cycles and
closer to the original desired target performance & cost.

I hope this interests a few folks

Regards

johnjakson somewhere at usa dotty com

Jack

unread,

Feb 27, 2005, 7:46:46 AM2/27/05

I also implemented a simple multiprocessors in Xilinx FPGA, it look
like a transputer, which is a complete message passing communication -
no shared memory (if it's correct).

In my HW/SW platform, problem is a programming method , which is a
explicit synchronization by programmer. If someone is (has been)
working on this and alleviated this problem, please let us know, for
example, parallel OCCAM compiler, or any.

thankyou

JJ

unread,

Feb 27, 2005, 8:56:38 AM2/27/05

Good to hear about other message passing multiprocessors in
development or done.

What stage are you at, functioning cpu HW, compilers etc?
Do you have public docs and whats your plans for it?
What performamce did you get in Xilinx etc etc?

Not clear what your question is, I assume you know Occam/CSP and
familiar with KROC etc. Do you need compiler support from others for
par Occam or do you roll your own? For myself I roll my own compiler
because I want a subset Verilog combined with Occam in the C syntax,
not something anybody else would likely do.

OK ffrom google search I see what you are up to with MB and OPBs etc.
MB wasn't public when I started so I chose to do original design before
MB had a chance to interest me. Still would be interesting if UKC
(KROC) notices MB/NiosII instead of concentrating on hard cpus for the
embedded space where Occam can be very usefull, that would help you
enormously to run Occam. We'll see.

regards

johnhjakson at usa dot com

C Gayle

unread,

Feb 27, 2005, 8:21:58 PM2/27/05

"JJ" <johnj...@yahoo.com> wrote in message
news:1109070248....@o13g2000cwo.googlegroups.com...

> Hi Ram, Derek, others
>
> The Tanaka paper looks quite interesting but for only 24Mhz perf is a
> little disappointing and there is the issue of ST's IP rights etc.

Some time back I asked this newsgroup about ST's IP rights to the
Transputer. Most replies favoured avoiding replicating the technology
because ST would definitely pursue anyone who recreated it.

However a UK legal friend of mine said that ST would find enforcement of the
original UK Transputer patents very difficult as:

"the spirit of a patent allows the inventor/patent holder sole idea
ownership for 15-25years. At which point others may legally replicate,
experiment, develop, manufacture and profit from the idea."

As the original transputer patents date back to 1980-84 and the patent
holder nolonger experiment, develop or manufacture from the original design
we are quite free to create our own transputers. ST say no.

My friend further said "STmicroelectronics WILL COME AFTER YOU, IF ONLY TO
SCARE YOU AND OTHERS."

Other loop holes ST would use to legally prevent replication of the
Transputer include: Multi-lateral patent enforcement. Ie: UK patents date
1980-84, but the US patents cover the 1990's; so no US replication allowed.
More recent copyrights for HDL Transputer cores of their own; 100mbps
Transputer links on the same silicon as i486's etc
------------------------
IMHO

ST just want to profit from the ideas of Inmos. I say replicate in secret
and dump the design on the net. If it becomes popular and someone begins to
profit, ST will go after them. Any court order to cease will have to be
backed up by ST transputer silicon in production and market.

The ST argument: "It just is not profitable for us to produce Transputers"
should lead an unbias Judge to release the patent. Remember patents are to
allow their holder's time to make money, not sit on technology. That's
called a copyright. Kinda like Barnsley and fractal video compression, but I
digress.

Nme. God Bless.

WHILE Intel
SEQ
in ? Garbage
out ! Garbage

Ziggy

unread,

Feb 28, 2005, 12:16:54 PM2/28/05

I also feel that if we are just expirmenting, no judge would care. ITs
not like we are duplicating a current product for a profit..

Not much different then people recreating a TRS80 Z80 series ... there
is no revenue lost to the parent..

But if people are worried, just use a fake name as you progress..

0 new messages