What would you do with a chip that had 10K processors on it?

Walter Banks

unread,

Apr 30, 2015, 8:19:40 AM4/30/15

to

A hypothetical question well not so hypothetical.

I have been working on an ISA and tools project for a processor that has
a
compiler directed ISA and data size as part of the compiler
optimization.

It is a truly huge amount of processing power for a single package. Cost

probably in the $20 range by its release.

w..

gareth

unread,

Apr 30, 2015, 9:34:12 AM4/30/15

to

"Walter Banks" <wal...@bytecraft.com> wrote in message
news:55421EE4...@bytecraft.com...

Perchance a descendant of the picoChip range?

Christian Brunschen

unread,

Apr 30, 2015, 10:09:18 AM4/30/15

to

In article <55421EE4...@bytecraft.com>,

I would very much like to hear more about this! Anything you can share
publicly would be greatly appreciated!

>w..

// Christian

Walter Banks

unread,

Apr 30, 2015, 10:51:22 AM4/30/15

to

No it isn't a descendant of picoChip. The project started as a informal
exchange of idea's that evolved over the last 2 or 3 years. We are at a
point where looking at how a chip that has more processors than the
memory of many mainframes that I first used can or should be used.

Clearly taking conventional code and extracting parallism would use a
few of the processors. This chip's capability begs for a different kind
of programming paradigm.

w..

Walter Banks

unread,

Apr 30, 2015, 11:06:36 AM4/30/15

to

> // Christian

For the purposes of this discussion a reasonable starting point is 10K
heterogeneous processors each at about 150MIPS with reasonable
communication between them.

As details are made public I will post them or links.

I asked the question primarily because the computers that we all know
and love started out of the only technology that could create practical
functional computers. 65 years later we have mostly spent lifetimes
organizing applications as a series of sequential steps.

Even plug-board analog computers ran its applications with parallel
functions.

w..

Christian Brunschen

unread,

Apr 30, 2015, 12:46:46 PM4/30/15

to

In article <mhtgdp$vgh$1...@speranza.aioe.org>,

Walter Banks <wal...@bytecraft.com> wrote:
>On 30/04/2015 10:08 AM, Christian Brunschen wrote:
>> In article <55421EE4...@bytecraft.com>,
>> Walter Banks <wal...@bytecraft.com> wrote:
>>> A hypothetical question well not so hypothetical.
>>>
>>> I have been working on an ISA and tools project for a processor that has
>>> a compiler directed ISA and data size as part of the compiler
>>> optimization.
>>>
>>> It is a truly huge amount of processing power for a single package. Cost
>>> probably in the $20 range by its release.
>>
>> I would very much like to hear more about this! Anything you can share
>> publicly would be greatly appreciated!
>> // Christian
>
>For the purposes of this discussion a reasonable starting point is 10K
>heterogeneous processors

How heterogeneous? i.e., vastly different capabilities between
different cores, or "just" that each core runs its own program (MIMD
vs SIMD), or something else?

>each at about 150MIPS with reasonable
>communication between them.

Communication how - serial links (like the Inmos Transputers), a
crossbar or similar switch, shared memory, ... ?

Sorry if I'm asking too many questions!

>As details are made public I will post them or links.

I'm looking forward to it!

>I asked the question primarily because the computers that we all know
>and love started out of the only technology that could create practical
>functional computers. 65 years later we have mostly spent lifetimes
>organizing applications as a series of sequential steps.
>
>Even plug-board analog computers ran its applications with parallel
>functions.

Almost every computer you can buy today has at least some small number
of multiple cores - and that's without even considering GPUs, which
are also becoming more used for computation these days. So in some
areas, people are looking more into massively parallel computations,
also because we can't easily increase sequential speed of CPUs at this
time. (Even a Raspberry Pi comes with a dual-core ARM CPU these days.)

There's also things like the Parallella board
<http://www.adapteva.com/parallella-board/>, which combine a "main"
ARM CPU with an FPGA and a 16- or 64-core processor board. So
parallelising things is very much something being looked into at
various price/performance levels - which makes the chip you describ
quite interesting!

>w..

// Christian

Whiskers

unread,

Apr 30, 2015, 12:52:21 PM4/30/15

to

Words such as 'parallel', 'cluster', 'neural network', and 'artificial
intelligence', spring to mind.

How much data storage is there on these chips? How quickly can data be
moved on and off the chip? What power and cooling facilities are
required?

--
-- ^^^^^^^^^^
-- Whiskers
-- ~~~~~~~~~~

Kerr Mudd-John

unread,

Apr 30, 2015, 3:01:11 PM4/30/15

to

On Thu, 30 Apr 2015 16:06:23 +0100, Walter Banks <wal...@bytecraft.com>
wrote:

> On 30/04/2015 10:08 AM, Christian Brunschen wrote:
>> In article <55421EE4...@bytecraft.com>,
>> Walter Banks <wal...@bytecraft.com> wrote:
>>> A hypothetical question well not so hypothetical.
>>>
>>> I have been working on an ISA and tools project for a processor that
>>> has
>>> a compiler directed ISA and data size as part of the compiler
>>> optimization.
>>>
>>> It is a truly huge amount of processing power for a single package.
>>> Cost
>>> probably in the $20 range by its release.
>>
>> I would very much like to hear more about this! Anything you can share
>> publicly would be greatly appreciated!
>> // Christian
>
> For the purposes of this discussion a reasonable starting point is 10K
> heterogeneous processors each at about 150MIPS with reasonable
> communication between them.
>

I think the Forth chaps did something like that:
http://www.greenarraychips.com/

> As details are made public I will post them or links.
>
> I asked the question primarily because the computers that we all know
> and love started out of the only technology that could create practical
> functional computers. 65 years later we have mostly spent lifetimes
> organizing applications as a series of sequential steps.
>

Indeed. So I imagine it'll be Bright Young Things you want to invent new
ways.

> Even plug-board analog computers ran its applications with parallel
> functions.
>
> w..
>
>
>
>

--
Bah, and indeed, Humbug

Walter Banks

unread,

Apr 30, 2015, 4:07:30 PM4/30/15

to

On 30/04/2015 12:52 PM, Whiskers wrote:
> On 2015-04-30, Walter Banks <wal...@bytecraft.com> wrote:
>> A hypothetical question well not so hypothetical.
>>
>> I have been working on an ISA and tools project for a processor that
>> has a compiler directed ISA and data size as part of the compiler
>> optimization.
>>
>> It is a truly huge amount of processing power for a single package.
>> Cost
>>
>> probably in the $20 range by its release.
>>
>> w..
>
> Words such as 'parallel', 'cluster', 'neural network', and 'artificial
> intelligence', spring to mind.
>

Along with IEC/ISO 61-131 and friends, event driven processing.

> How much data storage is there on these chips?

overall order of magnitude megabytes.

> How quickly can data be moved on and off the chip?

Not completely sure I have mostly been working on application
implementation optimization issues related to compute intensive
applications. I will see if I can get a real answer.

w..

Quadibloc

unread,

Apr 30, 2015, 6:01:36 PM4/30/15

to

Trade it for a chip that had only 1,024 processors on it - and some on-chip RAM?

John Savard

Quadibloc

unread,

Apr 30, 2015, 6:03:45 PM4/30/15

to

On Thursday, April 30, 2015 at 9:06:36 AM UTC-6, Walter Banks wrote:

> Even plug-board analog computers ran its applications with parallel
> functions.

True, and the ENIAC was a digital embodiment of that type of design: von Neumann
"spoiled" it, and caused a large decrease in its performance, by working out a
wiring method that made it easier to reprogram by moving it towards the modern
programmable paradigm.

But that's a dataflow-style design, where the parallel unit is much smaller
than a CPU.

John Savard

Mike Spencer

unread,

Apr 30, 2015, 10:55:50 PM4/30/15

to

Walter Banks <wal...@bytecraft.com> writes:

> I asked the question primarily because the computers that we all know
> and love started out of the only technology that could create practical
> functional computers. 65 years later we have mostly spent lifetimes
> organizing applications as a series of sequential steps.
>
> Even plug-board analog computers ran its applications with parallel
> functions.

Wasn't the difficulty with the Connection Machine that nobody could
figure out how to write programs for it that had performance
commensurate with the trouble it took to write them?

I have an idea:

Imbue each of your 10K CPUs/cores with analogues of ego and greed and
establish an accounting medium of exchange between them, thus creating
a model of capitalism. Then, as any Econ 101 student (or professor)
will tell you, The Market will magically ensure the most efficient
distribution of tasks, resources and performance among the CPUs.

The possibility of friction or externalities is left as an execise for
the reader.

--
Mike Spencer Nova Scotia, Canada

Stan Barr

unread,

May 1, 2015, 3:04:48 AM5/1/15

to

On Thu, 30 Apr 2015 20:01:11 +0100, Kerr Mudd-John <ad...@127.0.0.1> wrote:
> On Thu, 30 Apr 2015 16:06:23 +0100, Walter Banks <wal...@bytecraft.com>
> wrote:
>
>> On 30/04/2015 10:08 AM, Christian Brunschen wrote:
>>> In article <55421EE4...@bytecraft.com>,
>>> Walter Banks <wal...@bytecraft.com> wrote:
>>>> A hypothetical question well not so hypothetical.
>>>>
>>>> I have been working on an ISA and tools project for a processor that
>>>> has
>>>> a compiler directed ISA and data size as part of the compiler
>>>> optimization.
>>>>
>>>> It is a truly huge amount of processing power for a single package.
>>>> Cost
>>>> probably in the $20 range by its release.
>>>
>>> I would very much like to hear more about this! Anything you can share
>>> publicly would be greatly appreciated!
>>> // Christian
>>
>> For the purposes of this discussion a reasonable starting point is 10K
>> heterogeneous processors each at about 150MIPS with reasonable
>> communication between them.
>>
> I think the Forth chaps did something like that:
> http://www.greenarraychips.com/

Yes. There was also some talk about 4-way communications between
chips in a array using optical methods. Can't give you a reference...

--
Stan Barr pla...@bluesomatic.org

jmfbahciv

unread,

May 1, 2015, 9:54:52 AM5/1/15

to

Array within array arithmetic. Chemistry computations wouldn't
need a Cray.

Scheduling of any kind would be very interesting. A conflict
would get flagged as an exception to be handled with extra
code.

Driverless cars could use it to keep aware of other objects
and their vectors on a highway...OH! You could do 3-D+time
for flying cars.

/BAH

Walter Banks

unread,

May 1, 2015, 10:21:06 AM5/1/15

to

On 30/04/2015 10:55 PM, Mike Spencer wrote:
> Walter Banks <wal...@bytecraft.com> writes:
>
>> I asked the question primarily because the computers that we all know
>> and love started out of the only technology that could create practical
>> functional computers. 65 years later we have mostly spent lifetimes
>> organizing applications as a series of sequential steps.
>>
>> Even plug-board analog computers ran its applications with parallel
>> functions.
>
> Wasn't the difficulty with the Connection Machine that nobody could
> figure out how to write programs for it that had performance
> commensurate with the trouble it took to write them?
>
> I have an idea:
>
> Imbue each of your 10K CPUs/cores with analogues of ego and greed and
> establish an accounting medium of exchange between them, thus creating
> a model of capitalism. Then, as any Econ 101 student (or professor)
> will tell you, The Market will magically ensure the most efficient
> distribution of tasks, resources and performance among the CPUs.
>

Competitive real time resource and processing allocation, or competition
for resources.

How does the brain allocate resources and processing for tasks? There
might be an analog for anonymous computers.

w..

Stan Barr

unread,

May 1, 2015, 10:33:22 AM5/1/15

to

On Thu, 30 Apr 2015 10:51:06 -0400, Walter Banks <wal...@bytecraft.com> wrote:
>
> Clearly taking conventional code and extracting parallism would use a
> few of the processors. This chip's capability begs for a different kind
> of programming paradigm.

I recall some discussion of this sort of thing in Prolog circles.
Multiple trees could be traversed in parallel or something.
Weren't the Japanese working on a parallel Prolog machine?
My memory ain't what is was...

--
Stan Barr pla...@bluesomatic.org

Walter Banks

unread,

May 1, 2015, 11:00:58 AM5/1/15

to

The fundamental problem with that approach is it is fundamentally
looking for parallel paths in what is essentially an serial
implementation of an application. In general this can be found in array
processing sometimes and a half dozen other well defined structures most
I/O related.

To toss a little more gas on the fire. Think of how do you evaluate a
benchmark suite with this processor? Take a benchmark suite and compile
all of the programs, say 50 programs. Load them up and run them, and
this is where it gets interesting. 50 serial programs, many benchmarks
have lots of loops say the parallel extraction can spread each out to
run on 10 processors, total processor requirements of 500 processors, 5%
of the chip. Hit run and they all run in parallel. All of them would
complete within a few ms.

Now is the hard part. How do you report the results? Run time for each
program? Total run time for the whole benchmark? Scale for the 5% of the
chip usage?

I had this debate a few months ago.

w..

Walter Banks

unread,

May 1, 2015, 11:11:18 AM5/1/15

to

On 01/05/2015 9:54 AM, jmfbahciv wrote:

> Walter Banks wrote:

>> Clearly taking conventional code and extracting parallism would use a
>> few of the processors. This chip's capability begs for a different kind
>> of programming paradigm.
>
> Array within array arithmetic. Chemistry computations wouldn't
> need a Cray.

Weather, chemistry, physics and financial simulation are certainly
applications. Grace Hopper's last project was navigation for cruise
missiles which used terrain following before GPS. How it worked was
using 64 M6800 (not 68K's) each did pattern matching of an image area
and the created a total error vector that was fed into the control
system. That was about 40 years ago

>
> Scheduling of any kind would be very interesting. A conflict
> would get flagged as an exception to be handled with extra
> code.

Hey boys and girls you all get along nicely now.. Many processors opens
the possibility of independent arbitration.

w..

Ahem A Rivet's Shot

unread,

May 1, 2015, 11:30:02 AM5/1/15

to

On Thu, 30 Apr 2015 10:51:06 -0400

Walter Banks <wal...@bytecraft.com> wrote:

> Clearly taking conventional code and extracting parallism would use a
> few of the processors. This chip's capability begs for a different kind
> of programming paradigm.

Occam ?

--
Steve O'Hara-Smith | Directable Mirror Arrays
C:>WIN | A better way to focus the sun
The computer obeys and wins. | licences available see
You lose and Bill collects. | http://www.sohara.org/

Scott Lurndal

unread,

May 1, 2015, 11:35:40 AM5/1/15

to

Walter Banks <wal...@bytecraft.com> writes:
>On 01/05/2015 10:33 AM, Stan Barr wrote:
>> On Thu, 30 Apr 2015 10:51:06 -0400, Walter Banks <wal...@bytecraft.com> wrote:
>>>
>>> Clearly taking conventional code and extracting parallism would use a
>>> few of the processors. This chip's capability begs for a different kind
>>> of programming paradigm.
>>
>> I recall some discussion of this sort of thing in Prolog circles.
>> Multiple trees could be traversed in parallel or something.
>> Weren't the Japanese working on a parallel Prolog machine?
>> My memory ain't what is was...
>>
>The fundamental problem with that approach is it is fundamentally
>looking for parallel paths in what is essentially an serial
>implementation of an application. In general this can be found in array
>processing sometimes and a half dozen other well defined structures most
>I/O related.
>

Big question about this proposed architecture - are the CPUs coherent with
respect to memory (and with respect to DMA from I/O devices), or is it more
like Tilera, where the non-coherent CPUs use message passing over a mesh
interconnect?

As for benchmarks, everyone seems stuck on specint.

Charlie Gibbs

unread,

May 1, 2015, 12:18:31 PM5/1/15

to

I like it! There'd need to be some sort of mechanism for bribing
the supervisor programs (analogous to government) which oversee
the entire system.

--
/~\ cgi...@kltpzyxm.invalid (Charlie Gibbs)
\ / I'm really at ac.dekanfrus if you read it the right way.
X Top-posted messages will probably be ignored. See RFC1855.
/ \ HTML will DEFINITELY be ignored. Join the ASCII ribbon campaign!

Walter Banks

unread,

May 1, 2015, 2:01:19 PM5/1/15

to

On 01/05/2015 11:13 AM, Ahem A Rivet's Shot wrote:
> On Thu, 30 Apr 2015 10:51:06 -0400
> Walter Banks <wal...@bytecraft.com> wrote:
>
>> Clearly taking conventional code and extracting parallism would use a
>> few of the processors. This chip's capability begs for a different kind
>> of programming paradigm.
>
> Occam ?

SEQ PAR ALT who should be responsible for this the application or
compiler tools? I think that compilers can resolve many (most) SEQ
dependencies. Take this modified example from wikipedia

SEQ
x := x + 1
y := (x * x) * (B + C)

the (B + C) part of y := can be done in parallel with x := ..

w..

Walter Banks

unread,

May 1, 2015, 2:08:25 PM5/1/15

to

On 01/05/2015 12:18 PM, Charlie Gibbs wrote:
> On 2015-05-01, Mike Spencer <m...@bogus.nodomain.nowhere> wrote:

>>
>> I have an idea:
>>
>> Imbue each of your 10K CPUs/cores with analogues of ego and greed and
>> establish an accounting medium of exchange between them, thus creating
>> a model of capitalism. Then, as any Econ 101 student (or professor)
>> will tell you, The Market will magically ensure the most efficient
>> distribution of tasks, resources and performance among the CPUs.
>>
>> The possibility of friction or externalities is left as an execise for
>> the reader.
>
> I like it! There'd need to be some sort of mechanism for bribing
> the supervisor programs (analogous to government) which oversee
> the entire system.
>

I like it as well especially with your goal evaluation modification. A
herd of robots programmed this way as a social structure. The question
of what robots see as "good bribe" is an interesting question.

Somewhere I have a copy of robot wars a competative simulation package
from the late 70's that needs dusting off.

w..

Mike Spencer

unread,

May 1, 2015, 2:57:28 PM5/1/15

to

Walter Banks <wal...@bytecraft.com> writes:

> On 01/05/2015 12:18 PM, Charlie Gibbs wrote:
>
>> On 2015-05-01, Mike Spencer <m...@bogus.nodomain.nowhere> wrote:
>>
>>>
>>> I have an idea:
>>>
>>> Imbue each of your 10K CPUs/cores with analogues of ego and greed and
>>> establish an accounting medium of exchange between them, thus creating
>>> a model of capitalism. Then, as any Econ 101 student (or professor)
>>> will tell you, The Market will magically ensure the most efficient
>>> distribution of tasks, resources and performance among the CPUs.
>>>
>>> The possibility of friction or externalities is left as an execise for
>>> the reader.
>>
>> I like it! There'd need to be some sort of mechanism for bribing
>> the supervisor programs (analogous to government) which oversee
>> the entire system.
>>

I surmise that bribery would appear as an emergent property if the
mapping of the notions of ego and greed into algorithms was adequate.
That's what the "I" in "AI" is for, no?

> I like it as well especially with your goal evaluation modification. A
> herd of robots programmed this way as a social structure. The question
> of what robots see as "good bribe" is an interesting question.

Emergent property again. The embodied notion of ego presuably would
have internal complexity; how it evolves would be independent of the
programmer and perhaps impenetrable to h{im,er} as well.

> Somewhere I have a copy of robot wars a competative simulation package
> from the late 70's that needs dusting off.
>
> w..

Rod Speed

unread,

May 1, 2015, 7:52:57 PM5/1/15

to

"Walter Banks" <wal...@bytecraft.com> wrote in message

news:mi0249$oe9$1...@speranza.aioe.org...

That is surprisingly complicated and we don’t really know.

We do know that the brain is amazingly good at it tho.

And that some are much better at multitasking than others.

Quadibloc

unread,

May 1, 2015, 9:06:36 PM5/1/15

to

On Friday, May 1, 2015 at 5:52:57 PM UTC-6, Rod Speed wrote:
> "Walter Banks" <wal...@bytecraft.com> wrote in message
> news:mi0249$oe9$1...@speranza.aioe.org...

> > How does the brain allocate resources and processing for tasks?

> That is surprisingly complicated and we don't really know.

And if they get it *wrong*, one result could be computers that overheat if
given defective programs that do things like going into infinite loops... like
the computers in the original Star Trek TV series.

John Savard

jmfbahciv

unread,

May 2, 2015, 8:55:38 AM5/2/15

to

AFter I logged off, I thought about space junk, collisions and
predictions. 10K processors may not be enough ;-).

/BAH

gareth

unread,

May 2, 2015, 11:35:22 AM5/2/15

to

"Walter Banks" <wal...@bytecraft.com> wrote in message

news:mhtfh5$t4t$1...@speranza.aioe.org...

Associative memory?

Beating Google at their own game?

Walter Banks

unread,

May 2, 2015, 2:08:30 PM5/2/15

to

On 02/05/2015 11:35 AM, gareth wrote:
> "Walter Banks" <wal...@bytecraft.com> wrote in message
> news:mhtfh5$t4t$1...@speranza.aioe.org...
>> On 30/04/2015 9:34 AM, gareth wrote:
>>> "Walter Banks" <wal...@bytecraft.com> wrote in message
>>> news:55421EE4...@bytecraft.com...
>>>> A hypothetical question well not so hypothetical.
>>>> I have been working on an ISA and tools project for a processor that has
>>>> a
>>>> compiler directed ISA and data size as part of the compiler
>>>> optimization.
>>>> It is a truly huge amount of processing power for a single package. Cost
>>>> probably in the $20 range by its release.
>>>
>>> Perchance a descendant of the picoChip range?
>>
>> No it isn't a descendant of picoChip. The project started as a informal
>> exchange of idea's that evolved over the last 2 or 3 years. We are at a
>> point where looking at how a chip that has more processors than the memory
>> of many mainframes that I first used can or should be used.
>>
>> Clearly taking conventional code and extracting parallism would use a few
>> of the processors. This chip's capability begs for a different kind of
>> programming paradigm.
>
> Associative memory?
>
> Beating Google at their own game?
>

Content addressable memory was once a hot topic. Time to hit the
internet and see what is happening now.

w..

Walter Banks

unread,

May 2, 2015, 2:13:39 PM5/2/15

to

Up to this point I have essentially been using message passing because
it is an approach that I have used before. A bigger question for me is
what are the advantages of other approaches. It may be the applications
that I have looked at but interprocessor communication speed has not
been a major factor in overall performance.

>
> As for benchmarks, everyone seems stuck on specint.
>

w..

Quadibloc

unread,

May 2, 2015, 5:25:12 PM5/2/15

to

On Saturday, May 2, 2015 at 12:08:30 PM UTC-6, Walter Banks wrote:

> Content addressable memory was once a hot topic. Time to hit the
> internet and see what is happening now.

Well, that technology is used to make cache memory work. Otherwise, not a lot.

John Savard

gareth

unread,

May 3, 2015, 5:13:14 AM5/3/15

to

"Walter Banks" <wal...@bytecraft.com> wrote in message
news:55421EE4...@bytecraft.com...
>A hypothetical question well not so hypothetical.
> I have been working on an ISA and tools project for a processor that has
> a
> compiler directed ISA and data size as part of the compiler
> optimization.
> It is a truly huge amount of processing power for a single package. Cost
> probably in the $20 range by its release.

Ideal for SDR for amateur (ham) radio RX and TX applications, in addition
to test eqpt, for one of the German Vector Analysers used an early picoChip.

What is to be the RAM / ROM makeup of each processor, and how will the I/O
appear?

Scott Lurndal

unread,

May 3, 2015, 11:33:36 AM5/3/15

to

It depends on how well you can decompose your problem into a set
of parallel operations. Once you have shared data (e.g. graph
analysis), then having coherent memory becomes more important.

Walter Banks

unread,

May 4, 2015, 9:04:11 AM5/4/15

to

Getting away from finding potential parallel paths in a code block and
moving to parallel paths in a total application seems to be key. Individual
tasks may be sequential but the whole application could have lots
of parallelisms. For example, process automation.

w..

Walter Banks

unread,

May 4, 2015, 9:13:31 AM5/4/15

to

The picoChip is basically a lot of 16bit homogenous processors with a DSP
like ISA. Ours is a lot of heterogeneous processors. The picoChip is well suited

to instrumentation although I am surprised that it didn't use 18 or 24 bit ints.
(I
think in part that is so because of when it was originally developed)

w..

Walter Banks

unread,

May 4, 2015, 11:35:26 AM5/4/15

to

The equivalent of the sound that won't leave your head or a recurring
dream. After a while some preemptive task takes over and it goes away :)

w..

Ibmekon

unread,

May 4, 2015, 12:31:59 PM5/4/15

to

Does this mean we need a new compiler, language, or both ?

Carl Goldsworthy
--
Yes, I am paranoid.
And yes, they aaaarrrgggg

Ibmekon

unread,

May 4, 2015, 6:19:45 PM5/4/15

to

Harold

unread,

May 4, 2015, 9:58:16 PM5/4/15

to

Network packet processing is one of those applications that can be embarrassingly parallel. Commercially available multicore processor chips from vendors like EZchip, Cavium, and Broadcom are designed for handling millions of packets per second using dozens of parallel CPUs. Minimum-sized Ethernet packets arrive every 67.2 nsec on a 10 Gb/s link, so load balancing across multiple cores is essential if you want to process packets in software without any drops. The network interface that receives the packets from the wire therefore includes load-balancing hardware so that packets are distributed among the available CPUs for processing, usually based on a flow hash calculated from fields in the packet header.

The key problem here is not just plonking down lots of CPUs on a die. That part is easy. You also have to ensure cache coherency between the CPUs so that you can use a standard multithread programming model, with standard tools like SMP Linux, pthreads and gcc __sync primitives. This allows you to leverage all the gobs of code out there written to that model.

The alternative to implementing multicore cache coherency in hardware is a custom interconnect, a custom inter-CPU communications model, a custom programming language, a custom OS and so on. All code has to be written or rewritten from scratch. This approach has been tried several times - Inmos Transputer w' Occam and Ambric MPPA w' Java come to mind. These esoteric architectures are inevitably doomed to niche applications because there's no software out there to leverage. Cache coherency is hard to do in a scalable fashion, but it's really essential for commercial viability.

Disclaimer - I used to work for Tilera.

Ibmekon

unread,

May 5, 2015, 6:14:09 AM5/5/15

to

On Mon, 4 May 2015 18:58:14 -0700 (PDT), Harold <hzra...@comcast.net>
wrote:

So you used to work in retail - sort of anyway.

From reentrant code to conserve memory to coherent cache to duplicate
it - in one lifetime.
I saw renetrant code used in a network where several terminals might
use the same code at the same time. Specially programmed of course -
with each terminal having a control region.

Walter Banks

unread,

May 5, 2015, 7:36:59 AM5/5/15

to

Maybe, I currently have a C compiler that supports multiple processor
heterogeneous ISA's.

The next language step may well be some higher level approach to dealing
with whole system applications. The traditional approach has been OS's
that manage a collection of related applications, what I think is more
appropriate is more like the embedded system approach to big well
defined systems. Car engine controllers for example.

w..

Walter Banks

unread,

May 5, 2015, 8:03:08 AM5/5/15

to

On 04/05/2015 9:58 PM, Harold wrote:
> The alternative to implementing multicore cache coherency
> in hardware is a custom interconnect, a custom inter-CPU
> communications model, a custom programming language, a custom
> OS and so on. All code has to be written or rewritten from
> scratch. This approach has been tried several times - Inmos
> Transputer w' Occam and Ambric MPPA w' Java come to mind.
> These esoteric architectures are inevitably doomed to niche
> applications because there's no software out there to leverage.
> Cache coherency is hard to do in a scalable fashion, but it's
> really essential for commercial viability.
>

It should be pointed out that custom non mainstream approaches far
outnumber the conventional approaches for installed applications. I
suspect that massively parallel systems are going to be one of the niche
markets. They may be a niche market that does not depend on an
installed base of software.

When you have lots of processors on a chip it may be that the
applications are limited to those whose communication off chip is
limited. That opens up pattern recognition, neural simulation and
collection of simulations that have been mentioned (chemistry, physics,
weather, and astro)

The neural simulation for robotics and plant operations planning might
now have reached the point where they are possible. It is scary to think
of computers with more processors than memory cells in the first
computer I used (IBM 1620)

w..

Ibmekon

unread,

May 5, 2015, 8:11:50 AM5/5/15

to

On Tue, 05 May 2015 07:36:40 -0400, Walter Banks

<wal...@bytecraft.com> wrote:

>On 04/05/2015 12:21 PM, Ibmekon wrote:
>> On Mon, 04 May 2015 09:08:38 -0400, Walter Banks
>> <wal...@bytecraft.com> wrote:
>>
>>> Scott Lurndal wrote:
>>>
>>>> Walter Banks <wal...@bytecraft.com> writes:
>>>>>
>>>>> Up to this point I have essentially been using message passing because
>>>>> it is an approach that I have used before. A bigger question for me is
>>>>> what are the advantages of other approaches. It may be the applications
>>>>> that I have looked at but interprocessor communication speed has not
>>>>> been a major factor in overall performance.
>>>>
>>>> It depends on how well you can decompose your problem into a set
>>>> of parallel operations. Once you have shared data (e.g. graph
>>>> analysis), then having coherent memory becomes more important.
>>>
>>> Getting away from finding potential parallel paths in a code block and
>>> moving to parallel paths in a total application seems to be key. Individual
>>> tasks may be sequential but the whole application could have lots
>>> of parallelisms. For example, process automation.
>>>
>>> w..
>>>
>>>
>>
>> Does this mean we need a new compiler, language, or both ?
>
>Maybe, I currently have a C compiler that supports multiple processor
>heterogeneous ISA's.

Is the C compiler available online ?
I am messing with gcc currently and would be interested to see the
language extensions, if any are needed. Just ordered an RK3288 with
the intention of transferring from an intel duo. But I have yet to
investigate any implications for my C programs.

>
>The next language step may well be some higher level approach to dealing
>with whole system applications. The traditional approach has been OS's
>that manage a collection of related applications, what I think is more
>appropriate is more like the embedded system approach to big well
>defined systems. Car engine controllers for example.
>
>w..

With separate processors or cores running video, sound etc dedicated
language programs - we seems to be well along the way.

Walter Banks

unread,

May 5, 2015, 11:03:20 AM5/5/15

to

Christian Brunschen email me your contact information. w..

greymausg

unread,

May 5, 2015, 12:57:21 PM5/5/15

to

My nephew works in building management, particularly in planning
building, where there are parts of a job that have to be done in
sequence, buying the land, checking titlem getting planning permission,
then other parts done in parallel, walls, installing cables, and then
there is putting on the roof (you can't put on the roof, until the walls
are up, but you can build room dividers).. so there is already a set of
[paradigms?] for this sort of stuff.

I remember occam, that seemed to be the answer to a lot of problems, but
nothing ever happened. Maybe `Occams Razor' dictated that there were
already answers?

--
maus
_ _ ._ .._ ...

Rob Warnock

unread,

May 5, 2015, 1:40:03 PM5/5/15

to

greymausg <ma...@mail.com> wrote:
+---------------

| My nephew works in building management, particularly in planning
| building, where there are parts of a job that have to be done in
| sequence, buying the land, checking titlem getting planning permission,
| then other parts done in parallel, walls, installing cables, and then
| there is putting on the roof (you can't put on the roof, until the walls
| are up, but you can build room dividers).. so there is already a set of
| [paradigms?] for this sort of stuff.

+---------------

See <http://en.wikipedia.org/wiki/Gantt_chart>.

Some personal trivia: One of the first things I ever wrote
in SNOBOL was a program to read a list of things to be done
and their dependencies, do a topological sort[1] of the
dependencies, and print out a Gantt chart using asterisks
to print the "bars".

-Rob

[1] Tip o' the hat to Don Knuth for "Algorithm T" in "The Art
of Computer Programming", Vol.1 Ch.2. "Information Structures".

-----
Rob Warnock <rp...@rpw3.org>
627 26th Avenue <http://rpw3.org/>
San Mateo, CA 94403

Andrew Swallow

unread,

May 5, 2015, 1:54:43 PM5/5/15

to

On 05/05/2015 13:11, Ibmekon wrote:
> On Tue, 05 May 2015 07:36:40 -0400, Walter Banks

{snip}

>>
>> The next language step may well be some higher level approach to dealing
>> with whole system applications. The traditional approach has been OS's
>> that manage a collection of related applications, what I think is more
>> appropriate is more like the embedded system approach to big well
>> defined systems. Car engine controllers for example.
>>
>> w..
>
> With separate processors or cores running video, sound etc dedicated
> language programs - we seems to be well along the way.
>
> Carl Goldsworthy
> --
> Yes, I am paranoid.
> And yes, they aaaarrrgggg
>
>
>

One possible design is to have most of the processors working on a
single problem while a handful do things like run the text editor and
peripherals.

gareth

unread,

May 5, 2015, 2:00:22 PM5/5/15

to

"Andrew Swallow" <am.sw...@btinternet.com> wrote in message
news:-omdne_FZL1BntTI...@giganews.com...

>>
> One possible design is to have most of the processors working on a single
> problem while a handful do things like run the text editor and
> peripherals.

Or use them all to destabilise the basis of secure computation by seeking
for prime factors?

Andrew Swallow

unread,

May 5, 2015, 2:08:10 PM5/5/15

to

You may want something to scan the keyboard, display progress and write
interim data to the disk.

gareth

unread,

May 5, 2015, 2:20:21 PM5/5/15

to

"Andrew Swallow" <am.sw...@btinternet.com> wrote in message

news:FPudnYAYLJyemtTI...@giganews.com...

With so many processors at our disposal, some could be used as little more
than
logic gates to achieve any hardware function?

jmfbahciv

unread,

May 6, 2015, 8:54:13 AM5/6/15

to

Would it be limited to simulations? I suppose an emulation
would always lag a little behind real time behaviour since
electricity is so slow.

When corrections have to be done based on what some other CPUs
calculate, the lag would be even longer. An example might
be catching a fly ball in left field with wind gusts and rain.
If the calcs were limited to one CPU, the lag time calculating
where the baseball mitt should be placed would be very long.
How would the processing be split up? One CPU dealing with
the wind; another with the ball arc; another with baseball
mitt placement?

/BAH

Scott Lurndal

unread,

May 6, 2015, 9:26:39 AM5/6/15

to

Electricity is slow? Compared to what?

Dave Garland

unread,

May 6, 2015, 10:52:29 AM5/6/15

to

5 ns/m in Cat.5

Quadibloc

unread,

May 6, 2015, 4:01:07 PM5/6/15

to

Which _is_ only around 0.6c .

John Savard

jmfbahciv

unread,

May 7, 2015, 7:30:05 AM5/7/15

to

Light, EMF, chemical reactions, and some physics.

/BAH

Message has been deleted

Andrew Gabriel

unread,

May 7, 2015, 3:29:34 PM5/7/15

to

In article <55421EE4...@bytecraft.com>,

Walter Banks <wal...@bytecraft.com> writes:
> A hypothetical question well not so hypothetical.
>
> I have been working on an ISA and tools project for a processor that has
> a
> compiler directed ISA and data size as part of the compiler
> optimization.
>
> It is a truly huge amount of processing power for a single package. Cost
>
> probably in the $20 range by its release.

Issues with building chips with really large numbers of cores (or hardware
threads) relate to being able to build a balanced system design, i.e.
store bandwidth that can feed so many processors without becoming the
bottleneck, enough memory for your thread's stacks and application data,
I/O bandwidth/channels that can perform sufficient I/O, etc.
If you don't get those right, many of your processors will spend time idle
when there's work to do, negating the point of including so many cores in
the first place.

Obviously, it depends what you intend to use the system for - there will
some applications which might not need all of these extras, but most apps
will need at least some of them.

You might look at what Oracle has done with the SPARC chip in last couple
of years, and how they have to scale up the capabilities of the rest of
the system design to support chips with large numbers of hardware threads,
e.g. 32TB main memory with very high memory bandwidth, etc. You might also
note the difficulty in finding real uses for such systems - they are almost
always split into large numbers of virtual machines, but they are often
considered too big even for that - too many eggs in one basket. You really
do need a specific use case in mind for building chips/systems with such
large numbers of cores/hardware threads.

--
Andrew Gabriel
[email address is not usable -- followup in the newsgroup]

Stephen Sprunk

unread,

May 7, 2015, 4:23:53 PM5/7/15

to

On 07-May-15 14:28, Andrew Gabriel wrote:
> You might look at what Oracle has done with the SPARC chip in last
> couple of years, and how they have to scale up the capabilities of
> the rest of the system design to support chips with large numbers of
> hardware threads, e.g. 32TB main memory with very high memory
> bandwidth, etc. You might also note the difficulty in finding real
> uses for such systems - they are almost always split into large
> numbers of virtual machines, but they are often considered too big
> even for that - too many eggs in one basket. You really do need a
> specific use case in mind for building chips/systems with such large
> numbers of cores/hardware threads.

~15 years ago, I was touring a data center; most of it was boring, just
rack after rack of servers and switches, but then we came to a pair of
monstrous servers (a full rack each), the largest Sun made. They housed
the (allegedly) largest Oracle database in the world. I said it'd be
cheaper to cluster small servers with more total CPUs/RAM. They'd tried
that but, due to the specific workload (very high percentage of writes),
performance was actually lower.

I don't know Oracle's clustering model (then or now) at all, but in
other databases I'm familiar with, the model is many "slaves" that can
handle reads but forward all writes to a single "master". Obviously, if
the write load exceeds the capacity of that master, the entire system
falls apart. So, for some workloads, you may need a few really big
servers rather than a bunch of small ones.

S

--
Stephen Sprunk "God does not play dice." --Albert Einstein
CCIE #3723 "God is an inveterate gambler, and He throws the
K5SSS dice at every possible opportunity." --Stephen Hawking

Quadibloc

unread,

May 8, 2015, 2:13:28 AM5/8/15

to

On Wednesday, May 6, 2015 at 7:26:39 AM UTC-6, Scott Lurndal wrote:
> jmfbahciv <See....@aol.com> writes:

> >Would it be limited to simulations? I suppose an emulation
> >would always lag a little behind real time behaviour since
> >electricity is so slow.

> Electricity is slow? Compared to what?

Well, computer simulations are slow compared to reality because *arithmetic* is
slow compared to the actual analog phenomena. But, yes, electricity itself is
pretty fast.

John Savard

Walter Banks

unread,

May 8, 2015, 7:24:12 AM5/8/15

to

I suspect that its biggest uses are not likely to be high performance
data applications but something more pedestrian. A low cost chip with
lots of processing power and almost by definition similar amounts of I/O
to current embedded systems chips. This would open the doors to embedded
systems that can use lots of processing, robotics would certainly be one
application, some kinds of simulators.

The paradyme shift is the low cost parallel computing aspect where
processing elements are only several orders of magnitude more expensive
than memory rather than thousands to 100 thousand times as expensive.

w..

Andrew Gabriel

unread,

May 8, 2015, 1:19:12 PM5/8/15

to

In article <mighio$mv0$1...@dont-email.me>,

Oracle's clustering model is usually based on Oracle RAC, although it
can use some other vendors' clustering models. RAC is a multi-master
model, with communication between masters to handle locking when
necessary. It doesn't scale fantastically well, and a large single
system image can win out (Oracle does scale very well across large
numbers of hardware threads found in large single systems, but those
systems are expensive).

Exadata database system introduced the concept of slaves in the form
of intelligent Exadata storage cells. These are multiple ASM storage
systems, which are passed the queries by the masters, and then work to
return the results to the master from the section of the database stored
on their own local disks. The master which received the query collates
the answers from all the slaves, and returns the results. This enables
more parallel processing than any single master could do, bearing in
mind these are x86 systems and not a large single system image. The
Exadata storage cells can be used separately (not part of Exadata),
but this is rarely done AFAIK.

In industry, new designs for giant databases have tended to move away
from requiring ACID relational databases because of the difficulty and
expense in scaling them (horizontally or vertically or geographically),
and there's been a move to using object store based databases, sometimes
spread over continents. However, large legacy relational databases will
be around for some time yet.

Shmuel Metz

unread,

May 13, 2015, 8:47:17 AM5/13/15

to

In <mhtgdp$vgh$1...@speranza.aioe.org>, on 04/30/2015
at 11:06 AM, Walter Banks <wal...@bytecraft.com> said:

>I asked the question primarily because the computers that we all
>know and love started out of the only technology that could create
>practical functional computers. 65 years later we have mostly
>spent lifetimes organizing applications as a series of sequential
>steps.

There's a lot of parallelism in IBM mainframes, both within a CPU[1]
and between CPU's[2], but scaling up to thousands of processsors would
be difficult. OTOH, scaling up from two was once thought to be
difficult.

[1] Multiple SRB's and TCB's, typically not tied to a specific CPU.

[2] There are specialized instructions to reduce the overhead of
queues accessed concurrently from multiple processors.

--
Shmuel (Seymour J.) Metz, SysProg and JOAT <http://patriot.net/~shmuel>

Unsolicited bulk E-mail subject to legal action. I reserve the
right to publicly post or ridicule any abusive E-mail. Reply to
domain Patriot dot net user shmuel+news to contact me. Do not
reply to spam...@library.lspace.org

Shmuel Metz

unread,

May 13, 2015, 8:57:44 AM5/13/15

to

In <mi04em$tv4$1...@speranza.aioe.org>, on 05/01/2015
at 11:00 AM, Walter Banks <wal...@bytecraft.com> said:

>Think of how do you evaluate a benchmark suite with this processor?

The same as any other benchmark; you report the metrics you claim to
be measuring, normally a time from start to finish.

>How do you report the results?

What results?

>Run time for each program?

Only if that is what you agreed to report.

>Total run time for the whole benchmark?

That sounds most reasoanble, but see above.

>Scale for the 5% of the chip usage?

I wouldn't trust that. If you want 50% of the chip, submit 10 time
more jobs and measure the actual elapsed time.

Shmuel Metz

unread,

May 28, 2015, 6:47:14 AM5/28/15

to

In <mi33qq$16d$1...@speranza.aioe.org>, on 05/02/2015
at 02:08 PM, Walter Banks <wal...@bytecraft.com> said:

>Content addressable memory was once a hot topic.

It may not be a hot topic, but it never went away. Some processors
still use a hybrid CAM for TLB.

Shmuel Metz

unread,

May 28, 2015, 9:56:52 AM5/28/15

to

In <tc4hkalsmoibj7447...@4ax.com>, on 05/05/2015

at 11:14 AM, Ibmekon said:

>From reentrant code to conserve memory

Reentrant code not only makes it easier[1] to share memory, it alos is
necessary if you have shared data.

[1] Yes, you can write R/W reentrant code. No, it's not a good idea
except for instrucctional purposes.

Scott Lurndal

unread,

May 28, 2015, 10:23:41 AM5/28/15

to

Shmuel (Seymour J.) Metz <spam...@library.lspace.org.invalid> writes:
>In <mi33qq$16d$1...@speranza.aioe.org>, on 05/02/2015
> at 02:08 PM, Walter Banks <wal...@bytecraft.com> said:
>
>>Content addressable memory was once a hot topic.
>
>It may not be a hot topic, but it never went away. Some processors
>still use a hybrid CAM for TLB.

Our processor has dozens of CAMS (TCAM, MCAM, et alia) for a
number of purposes (microTLBs, TLBs, MAC-tables, etc).

There are also large-scale CAM chips available for high-end
search applications.