Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

transputing without transputing

20 views
Skip to first unread message

Dave

unread,
Jun 10, 2006, 5:01:10 AM6/10/06
to
With transputer hardware confined to the annuals of history, does the
newsgroup think "Transputer Architectures" designed without transputers
but using Occam, Parallel-C and/or Java will maintain interest in
parallel technology?

JJ's FPGA sounds excellent, but I don't think (correct me if I am wrong)
it is going to be machine code compatible with a T4 or T8....

I suppose what I am asking is, should we:
"reinvent the transputer processor" or should we:
"reimplement the transputer system architecture(s)". The latter on
different processors.

Don't get me wrong SPOC is excellent, but the real estate is somewhat
excessive. I wonder if "JJ's FPGA" or a microcoded MicroBlaze, Pico or
PPC might provide better performance than SPOC in less space?

Thoughts ladies and gentlemen....

JJ

unread,
Jun 11, 2006, 7:20:40 AM6/11/06
to
Dave wrote:
> With transputer hardware confined to the annuals of history, does the
> newsgroup think "Transputer Architectures" designed without transputers
> but using Occam, Parallel-C and/or Java will maintain interest in
> parallel technology?
>

I might as well use that to update or clarify my work

Sometimes implementations should be left behind where they belong in
the dumpster or museum if they are lucky, but the best ideas can and
should be freely reused or worse get reinvented patented by those that
didn't see it already done before. I see in the Opteron server chips
with their multiway HT links, on the surface a sort of Transputer, not
surprising with so many Inmos alumni at AMD. Below the surface, still a
pigs mess of an x86 monster that should be .........bin. They make the
same dreadful mistake that Inmos made, Links should be cheap, but they
make them expensive. I can buy an Opteron with 1 or 2 Links, but what
good is that. They have no concept of TRAMs or scaleability in that
sense.

The Opterons and also Xeons are still stuck in the Empire mode, the
processor is the Emporer, and the links are for god knows what. The
Transputer model is the exact opposite, processors are cheap,
disposable and should proliferate throughout a system with the links to
tie it all together. It doesn't matter if half or even most of the
processors are idle so Amdahls law is and was always bogus, most x86s
are also 99% idle anyway. The important thing is that distributed
processor cores make system design alot easier if they have a
concurrent model to work on and borrowing an ARM or MIPs and trying to
use that in the same way just doesn't work. You would still have to
build a Link in hardware and add the missing stuff in software or
hardware.

The original Transputer hardware was designed at a time when CMOS was
easy to design and one could do basic circuit design with very long
paths for the critical section and it didn't matter. Every body did cpu
design with 70 gate paths back then. Most were matching the cpu clock
to the DRAM clock of the day. The addition of PLLs and caches forced or
required clock paths to get designed back to 10 gate delays but over
many design generations.

Tanaka did do an FPGA version of the T400 IIRC and it ran at 25MHz
about the same as the original, ie 20 years and no net improvement in
performance and at great cost in FPGA resources too.

Modern RISC design recognizes that extreme performance comes from
designing to the clock and that means around 10-15 fast gates per clock
or in FPGA 3 LUTs per clock ie match the BlockRam cycle time which can
go to 300MHz. The original design could never hope to be brought up to
par although the San Diego ST people did a respin on it not that long
ago but I only saw the EET story 2003 IIRC.

> JJ's FPGA sounds excellent, but I don't think (correct me if I am wrong)
> it is going to be machine code compatible with a T4 or T8....
>

Absolutely it should not be code compatible, the requirement in my mind
is for the cpu design to start with a clean slate and partition seq
grunt work onto general purpose "free or very low cost" PEs that can
really use any register to register instruction set and to put ALL the
things that define a Transputer into the MMU, ie Par support,
scheduler, links etc and to tie that into an object memory model with
protection per object. The R16 PEs I have, use about half the hardware
of the Microblaze for the same grunt throughput because they are
designed for instruction throughput with memory latency hiding assuming
the MMU will interleave all concurrent memory request of a dozen or so
PEs per Transputer node. This only works though when RLDRAM or SRAM is
used, regular SDRAM will have 10-20x less throughput. so 1 PE per MMU.

> I suppose what I am asking is, should we:
> "reinvent the transputer processor" or should we:
> "reimplement the transputer system architecture(s)". The latter on
> different processors.
>

If "we" reimplement on different off the shelf processors you really
don't solve anything IMHO, you just get to see what a mess they all
are. SInce you must add new link hardware, might as well add the MMU
you really want and a simpler cpu to go with it. Since the par
compilers are also non standard, there is nothing saved by using off
the shelf seq compilers.

The problem with fine grained concurrency is that is tortures the cache
model used by all current cpus which were designed strictly for single
threaded models with high data locality and very course grained ms
level task swapping. There is nothing in the MIPs, SPARC, x86
architecture that screams out, I want to be a Transputer when I grow
up.

Indeed the very essence of Transputing is orthorganol to current
architectures. If the MMU can be implemented, any cheap version of an
instruction set design can be put to good work, but the only practical
way to do anything today is with FPGAs.

> Don't get me wrong SPOC is excellent, but the real estate is somewhat
> excessive. I wonder if "JJ's FPGA" or a microcoded MicroBlaze, Pico or
> PPC might provide better performance than SPOC in less space?
>
> Thoughts ladies and gentlemen....

Not familiar with SPOCs real estate, I thought it was software
implementation.

MicroBlaze is just another MIPs sort of design targeted to FPGA
although you can get it now. It could perhaps be used as a PE but it
has the same problem as all single threaded designs. cache misses cost
lots of cycles and it isn't designed to hide latency. The Pico would be
barely useable in some very low end FSM controller apps, not worth the
distraction.

A modern Transputer will also reconsider the process model and what
that means, is it just occam or some other modern C+CSP with a
different name. In my mind processes model hardware concurrency, that
means my view of concurency includes a small chunk of HDL languages
such as Verilog which means one could design processes as code and
compile and execute on a modern Transputer OR synthesize processes as
hardware and put into same FPGA OR any statically fixed or dynamically
variable combination of both ie reconfigurable computing.


I just recently bought an Intel DualCore D805 with very nasty Intel/ATI
chipset mobo, very big mistake. Each core appears to be 15% faster than
my aging Athlon XP2400s, the clocks are 2.66GHz v 2GHz resp. So 33%
clock improvement gave me only 15% actual improvement. The seq
benchmark is my GUI desktop code running the redraw 1000 times on
BeOS. The real shame is that so far not a single OS has installed on it
properly, Win200 Pro, WIn2000 Server sp3, WinXP, various Linuxes are
all dead or barely complete the install then BSOD. BeOS on the other
hand flies through the instal in minutes and completes without issues
except that it doesn't have drivers for any of the Video, Sound,
Networking, USB, IDE, SATA ie basically nothing works except video in
emulation mode. The board has 2 PCI slots to resolve that issue. I am
back to 1MB/sec disk transfers.

This is not the multiprocessing experience I wanted, it is a nightmare
I never thought possible. You can see why I work on this Transputer
project, it is because the current computing environment is such a
mess. When I visualize how a PC could be built it is so easy to see.

Build TRAM modules in a fixed CC size with space for one larger faster
expensive or smaller slower cheaper FPGA and a few DRAMs or SRAM, and a
few chips for some other purpose or atleast bring the FPGA pins to a
daughter board. Each board could use all unused pins for links.

I could easily see a cheaper TRAM version having SDRAM with Spartan3
based single R16+MMU hosting the human devices, KBs, Mouses, low speed
USB, networking etc. This board can be bought from various vendors
already but usually far bigger than CC format.

A much more expensive Virtex 2,4 device with RLDRAM using 10 or so PEs
with RLDRAM would have the a video UXGA interface with enough grunt
power to do good graphics in software, no nVidia or ATI or DRM
anywhere. This is much higher end engineering, each 10 PE+MMU node
gives about 1000mips, and can be replicated a few times in the bigger
FPGAs limited mostly by the RLDRAM interface ports as well as heat.

One could gang the video TRAMs to get more monitor heads or to boost
per head performance SLI style. If I want multiple connected users, put
one low end TRAM inside each KB and link up to a mainframe of TRAMs. I
expect the links will only run near 150Mb/s or so but can be bonded fo
more bandwidth using SpaceWire or built in FPGA LVDS links.

You get the idea. The OS obviously uses the scheduler and memory
management built into the MMUs. The desktop will look very much like
BeOS Tracker but skinable to other styles. The entire file system
directory is always live in memory for instant searching. Thats what I
am working on for now so that I get a feel for what really needs to be
in the MMU later on. Remember Inmos added Blit and graphics codes to
the T800, I get to develop the OS, application code before deciding if
some leaf functions should be in hardware.

When you have to deal with 40 or more threads per Transputer node, you
get serious about keeping them all busy maybe not all the time, but
when a redraw is needed. tile the video screen redraw across threads.
The MMU makes it possible for larger objects such as a Filesystem or
application database to be visible to all PE graphic threads but
generally controlled by a seq parent thread/process.


The really big current idea is that in single threaded processor design
which really covers most everything today, the Memory Wall is the
killer of performance and is also the justification for all the
complexity in design. These designs absolutely have no choice but to
issue as many instructions as possible per clock, getting ever more
complicated and hotter. Every full cache miss turns into maybe 1000
dead cycles. More and more of the chip is just L2 SRAM or L3 DRAM cache
with very little actually used for the cpu today. Look at some recent
Opteron die pics. This is absolutely what you don't want in a
Transputer.

The really big contra idea is the Thread Wall, design lots of low cost
PEs to run slow enough with latency hiding so that with a high issue
rate RLDRAM, 10-20 PEs each with 4 threads, all can appear to have no
significant Memory Wall at all. All opcodes run in more or less fixed
times as long as memory is not fully loaded. This is what you do want
in a Transputer. Ironically if the world ever gets serious about
massive concurrency, this sort of multithreading design gets rid of
lots irrelevant hardware.


John Jakson
transputer guy
R16 paper at wotug.org

Derek Simmons

unread,
Jun 12, 2006, 10:18:28 PM6/12/06
to
You mentioned using your processor in a multi-processor application for
a graphics display and SLI (scan line interleaving). An alternative to
using SLI is using patches. SLI is a cheap and easy way to use existing
hardware to do parallel rendering.

I wrote a small, crude raytracing program for a transputer network
using 16 x 16 patches. The completed patches would be sent to a
destination TRAM, in the end the display TRAM, as a 1024 byte packet.
The first 256 bytes were packet header information, position, and
orientation information. Patches allowed me to subdivide the scene
space into smaller subspaces that each transputer could work on
independently without having the whole scene loaded on each processor.

I have been trying to apply these ideas to my FPGA graphics project but
it keeps turning into a bandwidth nightmare. I have a large number of
processing elements that are simple rasterizers trying to
simultaneously access the display frame buffer. I think I have it
almost solved, I'm just looking for a weekend I can apply myself
without any distractions.

If you are seriously considering implementing some primitive functions
into processor you might want to consider using patches as basic unit
of raw image information. It could lend itself to texture mapping and
modern GPU functions. You processor could become the PE of a new
generation of graphics supercomputers.

Message has been deleted

JJ

unread,
Jun 13, 2006, 11:07:06 AM6/13/06
to

Derek Simmons wrote:
> You mentioned using your processor in a multi-processor application for
> a graphics display and SLI (scan line interleaving). An alternative to
> using SLI is using patches. SLI is a cheap and easy way to use existing
> hardware to do parallel rendering.
>

I mention SLI only because it is the new trendy nVidious thing the
gamers all want.

Ofcourse with multiple processors and potentially multiple monitor
heads all at high rez say 2k by 1.5K, the database, tile partitioning
and bandwidth management is the key. The tile sizes need not be fixed
in any particular way, but will usually be dynamically partitioned on
the fly.

> I wrote a small, crude raytracing program for a transputer network
> using 16 x 16 patches. The completed patches would be sent to a
> destination TRAM, in the end the display TRAM, as a 1024 byte packet.
> The first 256 bytes were packet header information, position, and
> orientation information. Patches allowed me to subdivide the scene
> space into smaller subspaces that each transputer could work on
> independently without having the whole scene loaded on each processor.
>

With modern RLDRAM for the primary Transputer core there is enough
bandwidth and space to hold the database near all available PEs. If
adjacent Transputers had to pull anything through links, they would be
at a disadvantage but could still be useful to add more tiles. The
RLDRAM is key here since it give me (on paper) one full random access
per clock at 400MHz limited by interleaved banks, in practice with FPGA
Virtex limit, and the MMU that will be halved. Still there is also
block burst rates actually better than DDR ram which I also expect to
use 8 words at a go to do register refills. Ofcourse I won't have a
RLDRAM+Ramdac TRAM for some time since that is hard engineering. I may
just proto with SDRAM and simple low rez VGA dac off the shelf on a
Spartan, it will 10x slower though.

> I have been trying to apply these ideas to my FPGA graphics project but
> it keeps turning into a bandwidth nightmare. I have a large number of
> processing elements that are simple rasterizers trying to
> simultaneously access the display frame buffer. I think I have it
> almost solved, I'm just looking for a weekend I can apply myself
> without any distractions.

Your HW PEs would be equiv to my PEs, mine only execute usual RISC type
codes to render tiles. Since each Transputer could have 10 or so PEs
each with 4 threads, thats 40 or so tile engine threads with relatively
little communication overhead other than the initial setup &
partitioning. There is no reason why the PEs couldn't be enhanced with
special codes as needed for some co processor but that would likely
slow them down in clk freq but do more work. Or the co pro could be in
the MMU, I haven't decided yet.

My tiles are 256 pixels wide by n rows where screen area/40/256 gives
approx n. The 256 comes from the clipping scheme I use which is based
on 8x8 clip mask tiles. A render tile would then use 32 clip bits wide
(a mask word) by n/8. This is implemented in software in a GUI desktop
kit looking like the BeOS Tracker. Right now on all mouse events the
clip region is extended by mouse actions in low rez and periodically
another BeOS thread pulses and causes the Draw method to redraw where
ever the clip bits are set and clears them back. Right now the Draw
method simply walks the entire database tree and renders at the leaf
level where ever clip bits are on. By doing this clipping at 1/8 pixel
resolution, each area of screen that is covered by a clip bit does 64
pixel writes per clip bit test. This ofcourse means windows slightly
draw over each other, so I draw them back to front order so that the
over drawing is minimized. to edges. For each window, a local clip
region (or bit mask) is computed from the window list in front, ie the
topmost window subtracts its area backwards. For many windows
overlapping, the total amount of work done approaches a simple blit
over the entire screen with a rough penalty of perhaps 10%-20% or so
overdrawing. This is much simpler and much more efficient than the old
clip region that Apple used in MacOS based on real region math. In the
early days with no window content I had 250K random placed windows to
prove this clipping out, took about 1 minute to draw those. Note all
the database structures are using the software model of the MMU that
will later be turned into FPGA, essentialy a hash table of millions of
8 word lines.

Later I will change the Draw method to find the extent of the "show"
area to be redrawn and partition it and loop through tiled sub Draws
but based on 256b wide pixel blocks. If the screen update is small,
then only a few tiles need to be allocated. In occam those tiles would
all be in a par block, but in the current C scheme i would loop through
them which will add a little more inefficiency. Still it lets me
measure how much work is done x86 v proposed Transputer model.


>
> If you are seriously considering implementing some primitive functions
> into processor you might want to consider using patches as basic unit
> of raw image information. It could lend itself to texture mapping and
> modern GPU functions. You processor could become the PE of a new
> generation of graphics supercomputers.

I certainly hope so. I don't use any x86 asm and just plain C code so
the renderer Blit engine has to be extremely efficient at clipping ie
saving time. The use of 8x8 clip mask bits has enormously speeded up
redraws since when I started with a naive test every pixel against
window list.

What could end up in HW is really the innermost level Blit mapped into
the MMU DMA engine.

Right now this entire software package looks like a Image Thumbs
browser that loads up upto a few 100 medium pics as post IDCT memory
allows so I see everything as bitmap blits. Later I will be adding more
Draw method types for text, general graphics etc, so I will see how the
Blit package needs to be extended. I am aware that in MacOS, there were
just a very few innermost level clip able render routines so I don't
expect to do polygons or beziers in HW.

If you read my paper, you will see that I believe the PEs are basically
free, the MMU and the RLDRAM I/O is where the cost is, you can only
have so much I/O bandwidth on an FPGA. Putting graphics code into PE
engines may not be as smart as just adding more PEs to the MMU. Maybe
even allowing PEs to have local SRAM blocks to render their tiles.

So what is your real graphics interest?

JJ

unread,
Jun 13, 2006, 11:12:46 AM6/13/06
to

August West wrote:

> "Derek Simmons" <dere...@gmail.com> writes:
>
> > If you are seriously considering implementing some primitive functions
> > into processor you might want to consider using patches as basic unit
> > of raw image information. It could lend itself to texture mapping and
> > modern GPU functions. You processor could become the PE of a new
> > generation of graphics supercomputers.
>
> Interestingly, Aspex Semiconductor produce a massively-parallel
> computation engine, and their preferred method for programming is the
> patch.
>
> It's not particularly transputer-like, though.
>
> (http://www.aspex-semi.com/pages/technology/technology_asprocore.shtml)
> --
> and I will never grow so old again

Its taken 20years for Moores law to catch up with cpus, but it finally
seems everyone agrees quantity is far more efficient than frequency.
These guys are on the same page in many respects.

All of a sudden tiling cpus doesn't seem so difficult, except the age
old problem of managing memory bandwidth between them.

John

Derek Simmons

unread,
Jun 23, 2006, 4:02:49 PM6/23/06
to

> >
> > Interestingly, Aspex Semiconductor produce a massively-parallel
> > computation engine, and their preferred method for programming is the
> > patch.
> >
> > It's not particularly transputer-like, though.
> >
> > (http://www.aspex-semi.com/pages/technology/technology_asprocore.shtml)

I did go out and take a look at this...a long look. It seems to be more
like a Thinking Machine Connection Machine 2a on a chip than a
transputer. I wonder if anybody ever told Danny Hillis. I wonder if you
could program it using CmLisp?

Derek

Message has been deleted
0 new messages