Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Not enough parallelism in programming

8 views
Skip to first unread message

Joe Seigh

unread,
Aug 25, 2005, 7:53:02 AM8/25/05
to
At least in the opinion of this person here
http://www.eetimes.com/news/latest/showArticle.jhtml;jsessionid=JZEMR0B52MV2CQSNDBESKHA?articleID=169300057
I'm not sure what he's getting at though. Does he think multi-cores are
not the way to go but SIMD like the GPUs? Or is MIMD ok, we're just not
getting enough threads to run?


--
Joe Seigh

When you get lemons, you make lemonade.
When you get hardware, you make software.

Torben Ægidius Mogensen

unread,
Aug 25, 2005, 8:32:22 AM8/25/05
to
Joe Seigh <jsei...@xemaps.com> writes:

> At least in the opinion of this person here
> http://www.eetimes.com/news/latest/showArticle.jhtml;jsessionid=JZEMR0B52MV2CQSNDBESKHA?articleID=169300057
> I'm not sure what he's getting at though. Does he think multi-cores are
> not the way to go but SIMD like the GPUs? Or is MIMD ok, we're just not
> getting enough threads to run?

I think what he says is that SIMD is good for GPUs because their
workload suits this model, but that general-purpose CPUs that offer
parallelism in the form of MIMD or other won't find many programs
capable of exploiting said parallelism. This is mainly due to
mainstream programmming methodologies and languages being inherently
sequential. One reason for this is that independent subcomputations
are neither marked as such, nor are they easily recognisable as such
because the programming model uses a global updatable store where any
subprocess is able to use and modify the same storage as any other.

Research in automatically recognising parallelism in programs written
in mainstream languages has largely failed (with the exception of
numerical code written in Fortran), so IMO we need either to use
languages where parallelism is either explicit (and verifiably
independent) or recognizable with methods sufficiently robust that
small code changes won't break parallelism. The latter will probably
require restrictions that almost amount to explicit declaration of
parallelism, so the first way may be the way to go.

But it is incredibly hard to get a new language accepted in industry.
You need to convince people that switching is worth the investment in
compilers and IDEs, retraining of personnel, slower development until
programmers get used to the new model, the hassle of interfacing old
code to new, etc. If the main benefit is better exploitation of
paralellism, few people will bother to change, as few programs are
time-critical enough for parallelism to matter. And even when they
are, people often find it easier to invest in faster hardware than
recode their programs.

As for universities offering parallel programming languages as
undergraduate topics, students are already quite critical of
universities using languages that are not perceived as mainstream
enough to be useful to have on your CV when you go looking for a job.
So getting them to accept something that seems esoteric and very
specialist won't be easy.

Torben

Stephen Fuld

unread,
Aug 25, 2005, 10:35:57 AM8/25/05
to

""Torben Ægidius Mogensen"" <tor...@app-6.diku.dk> wrote in message
news:7zll2qt...@app-6.diku.dk...

snip

> we need either to use
> languages where parallelism is either explicit (and verifiably
> independent)

I'd like to explore this further. ISTM that there are only a few languages
that have any explicit way to declare parallelism and that they differ in
the "level" of parallelism they express. For example, I believe that Occam
expresses parallelism at a very low level (even evaluation of a single
expression can be specified to be in parallel. But COBOL offers explicit
parallelism at the level similar to a full thread. I'm not sure which, if
either of these is the "right" thing to do.

So, the questions are

- Is there an existing language that expresses parallelism at the right
level for multi-core/multithreaded core to take advantage of? Is it easy
for programmers to use the language productivly?

- If not what would such a language look like? That is, what level of
parallelism is appropriate? How would the semantics/syntax be designed to
make program development easiest and most bug free?

--
- Stephen Fuld
e-mail address disguised to prevent spam


Andy Freeman

unread,
Aug 25, 2005, 11:00:35 AM8/25/05
to
Torben Ægidius Mogensen wrote:
> IMO we need either to use
> languages where parallelism is either explicit (and verifiably
> independent)

It isn't much of an exaggeration to say that programmers can't
correctly use explicit parallelism.

Yes, they think that they can, but most parallel programs with
non-trivial interaction shows that they're wrong. (Yes, those
programs seem to work much of the time, but they fail often
enough to demonstrate that we're mostly lucky.)

This inability shouldn't be all that surprising. Programmers
also have problems writing non-trivial sequential programs.

-andy

Nick Maclaren

unread,
Aug 25, 2005, 11:21:01 AM8/25/05
to

In article <1124982035.6...@z14g2000cwz.googlegroups.com>,
"Andy Freeman" <ana...@earthlink.net> writes:

|> Torben =C6gidius Mogensen wrote:
|> > IMO we need either to use
|> > languages where parallelism is either explicit (and verifiably
|> > independent)
|>
|> It isn't much of an exaggeration to say that programmers can't
|> correctly use explicit parallelism.

Well, many of those around here can, at least up to the same
standards as most programmers can use most programming languages.

|> Yes, they think that they can, but most parallel programs with
|> non-trivial interaction shows that they're wrong. (Yes, those
|> programs seem to work much of the time, but they fail often
|> enough to demonstrate that we're mostly lucky.)

Not in the domains where people do that seriously - admittedly
rather specialised. As I said at SunHPC, getting arbitrary
parallel designs right is a nightmare task, but there are several
restricted models that are feasible to debug. Experienced
parallel programmers use one of them.

For example, using lockstep models (BSP, MPI collectives etc.)
is feasible. So are many classes of dataflow.


Regards,
Nick Maclaren.

Joe Seigh

unread,
Aug 25, 2005, 11:23:32 AM8/25/05
to
Torben Ćgidius Mogensen wrote:
> Joe Seigh <jsei...@xemaps.com> writes:
>
>
>>At least in the opinion of this person here
>>http://www.eetimes.com/news/latest/showArticle.jhtml;jsessionid=JZEMR0B52MV2CQSNDBESKHA?articleID=169300057
>>I'm not sure what he's getting at though. Does he think multi-cores are
>>not the way to go but SIMD like the GPUs? Or is MIMD ok, we're just not
>>getting enough threads to run?
>
>
> I think what he says is that SIMD is good for GPUs because their
> workload suits this model, but that general-purpose CPUs that offer
> parallelism in the form of MIMD or other won't find many programs
> capable of exploiting said parallelism. This is mainly due to
> mainstream programmming methodologies and languages being inherently
> sequential. One reason for this is that independent subcomputations
> are neither marked as such, nor are they easily recognisable as such
> because the programming model uses a global updatable store where any
> subprocess is able to use and modify the same storage as any other.

[...]

But is he just saying this because he has the "embarrassingly parallel"
easy part. Why complain about the skill set out there when the scarcity
of skills is to your advantage. Get out of the graphics market, write
your own highly parallelized game engine and own the game industry which
makes more money than Hollywood. Piece of cake if you already know how
to write parallel programs.

Torben Ægidius Mogensen

unread,
Aug 25, 2005, 12:10:13 PM8/25/05
to
Joe Seigh <jsei...@xemaps.com> writes:

> Torben Ægidius Mogensen wrote:
> > Joe Seigh <jsei...@xemaps.com> writes:
> >
> >>At least in the opinion of this person here
> >>http://www.eetimes.com/news/latest/showArticle.jhtml;jsessionid=JZEMR0B52MV2CQSNDBESKHA?articleID=169300057
> >>I'm not sure what he's getting at though. Does he think multi-cores are
> >>not the way to go but SIMD like the GPUs? Or is MIMD ok, we're just not
> >>getting enough threads to run?
> > I think what he says is that SIMD is good for GPUs because their
> > workload suits this model, but that general-purpose CPUs that offer
> > parallelism in the form of MIMD or other won't find many programs
> > capable of exploiting said parallelism.
>

> But is he just saying this because he has the "embarrassingly parallel"
> easy part. Why complain about the skill set out there when the scarcity
> of skills is to your advantage. Get out of the graphics market, write
> your own highly parallelized game engine and own the game industry which
> makes more money than Hollywood. Piece of cake if you already know how
> to write parallel programs.

I don't think graphics engines are the problem, they are fairly easy
to parallelise on GPUs (which are more or less designed for this). It
is (almost) every other program that is the problem.

Torben

Torben Ægidius Mogensen

unread,
Aug 25, 2005, 12:15:24 PM8/25/05
to
"Stephen Fuld" <s.f...@PleaseRemove.att.net> writes:

Functional programmers have long argued that pure functional (or
value-oriented) programming is well suited for this. If you can't
modify store that has already been written, you don't need to
synchronise on every memoy access. The main isea is to run a function
in parallel with evaluation of its arguments, so it is only when you
access such an argument that you need synchronise. Lazy functional
languages have a simlar synchronisation mechanism already, so it will
cost relatively little. The main problem is that not all calls are
worth parallelising (even though it would be safe to do so), so you
need some way of guiding the decision of whether to spawn or not.
Static analysis is probably insufficient, but you can have user
annotations that say that if a condition is true, you spawn, otherwise
not. Wrong annotations will only mean slower pograms, not faulty
ones, so it is fairly safe to leave this to programmers.

Torben

Greg Lindahl

unread,
Aug 25, 2005, 1:42:45 PM8/25/05
to
In article <hHkPe.133637$5N3.1...@bgtnsc05-news.ops.worldnet.att.net>,
Stephen Fuld <s.f...@PleaseRemove.att.net> wrote:

>I'd like to explore this further. ISTM that there are only a few languages
>that have any explicit way to declare parallelism and that they differ in
>the "level" of parallelism they express.

After 30 years of research, there are more than just a few which have
been proposed, experimented with, and spilled much ink on papers about
the results, etc.

I'm not sure what you hope to accomplish with a comp.arch discussion
of such a huge body of work.

-- g

Nick Maclaren

unread,
Aug 25, 2005, 2:08:15 PM8/25/05
to
In article <430e0315$1...@news.meer.net>, Greg Lindahl <lin...@pbm.com> wrote:
>In article <hHkPe.133637$5N3.1...@bgtnsc05-news.ops.worldnet.att.net>,
>Stephen Fuld <s.f...@PleaseRemove.att.net> wrote:
>
>>I'd like to explore this further. ISTM that there are only a few languages
>>that have any explicit way to declare parallelism and that they differ in
>>the "level" of parallelism they express.
>
>After 30 years of research, there are more than just a few which have
>been proposed, experimented with, and spilled much ink on papers about
>the results, etc.

A good 30 years. To be fair, there are only a few that have been
used, successfully, by people whose interest is not in parallelism
as such. But even that number is larger than most people realise.

>I'm not sure what you hope to accomplish with a comp.arch discussion
>of such a huge body of work.

Well, one could have a general discussion of parallel paradigms,
but even that is not a small topic.


Regards,
Nick Maclaren.

Dan Koren

unread,
Aug 25, 2005, 2:28:18 PM8/25/05
to
"Stephen Fuld" <s.f...@PleaseRemove.att.net> wrote in message
news:hHkPe.133637$5N3.1...@bgtnsc05-news.ops.worldnet.att.net...

>
> ""Torben Ægidius Mogensen"" <tor...@app-6.diku.dk> wrote in message
> news:7zll2qt...@app-6.diku.dk...
>
> snip
>
>> we need either to use
>> languages where parallelism is either explicit (and verifiably
>> independent)
>
> I'd like to explore this further. ISTM that there are only a few
> languages that have any explicit way to declare parallelism and that they
> differ in the "level" of parallelism they express. For example, I believe
> that Occam expresses parallelism at a very low level (even evaluation of a
> single expression can be specified to be in parallel. But COBOL offers
> explicit parallelism at the level similar to a full thread. I'm not sure
> which, if either of these is the "right" thing to do.
>
> So, the questions are
>
> - Is there an existing language that expresses parallelism at the right
> level for multi-core/multithreaded core to take advantage of?


Yes: APL.


> Is it easy for programmers to use the language productivly?


Yes: this is why APL and its descendants (APL2, A, J, K) are
the darlings of the financial community.


> - If not what would such a language look like?


It would look pretty much like APL with a couple of extensions.


> That is, what level of parallelism is appropriate? How would the
> semantics/syntax be designed to make program development easiest and most
> bug free?


The builtin exception handling mechanism would look up the name
of the developer every time an exception truggers, and print out
immediately a pink slip ;-)

dk


Randy

unread,
Aug 25, 2005, 2:45:13 PM8/25/05
to
Torben Ćgidius Mogensen wrote:
> Joe Seigh <jsei...@xemaps.com> writes:
>
>>Torben Ćgidius Mogensen wrote:
...

>>>I think what he says is that SIMD is good for GPUs because their
>>>workload suits this model, but that general-purpose CPUs that offer
>>>parallelism in the form of MIMD or other won't find many programs
>>>capable of exploiting said parallelism.
>>
>>But is he just saying this because he has the "embarrassingly parallel"
>>easy part. Why complain about the skill set out there when the scarcity
>>of skills is to your advantage. Get out of the graphics market, write
>>your own highly parallelized game engine and own the game industry which
>>makes more money than Hollywood. Piece of cake if you already know how
>>to write parallel programs.
>
> I don't think graphics engines are the problem, they are fairly easy
> to parallelise on GPUs (which are more or less designed for this). It
> is (almost) every other program that is the problem.
>
> Torben

Right. The problem is that too few independent threads are being
created in the average app (by the programmer and/or the compiler), thus
the second core remains idle.

How to fix this? Compilers of today's primary languages (C/C++, Java,
Fortran) are unable to identify substantial independent threads within
serial programs. OoO-CPUs can do this to some degree, but their
"threads" are very short (perhaps 5-10 instructions), or they arise from
the splitting of loops (which is how graphics pipelines get their SIMD
parallelism). Neither approach works scalably well in serial programs
(since their loops are few and short threads don't delegate efficiently
onto other cores).

The answer? Threads will have to arise from 1) the introduction of new
languages that support implicit parallelism to produce longer threads,
or 2) a populist mastery of explicit PRAM-ish parallel programming
techniques within traditional languages (e.g. POSIX threads, OpenMP,
etc). The former is a problem since it's essentially impossible to sell
industry on the value of novel languages. The latter is a problem since
programmers are going to have to better understand how to identify and
minimize data and control *dependencies* in their code, and since few
tools exist to make this work easier (e.g. KAP's "Guide", now Intel's).

Maybe the best approach would be to explore a middle ground, some sort
of standardized extensions to existing languages that introduce
threading along with some coordination primitives that make their use a
lot prettier (and ideally enable validation). Both C++ and Java support
threads, but they're explicit, uncoordinated, and unvalidated. I think
what we need is native language extensions like coroutines, but that can
be coordinated axiomatically within the language (by defining data
dependencies, their scope, and allowing modification of those axioms as
the program runs and threads come and go and their execution priorities
evolve, as well as providing as much consistency validation as
possible). This might take form as a coherent alternative to OpenMP
that also spawns threads but puts more effort into constraining their
dependencies rather than just splitting loops.

Data-parallel C was in some ways similar to OpenMP. It was accepted as
a ISO? standard, but went nowhere in the early 90's (although it's
probably preferable to OpenMP). However, data parallel isn't the right
approach to programming general purpose multicore CPUs. At least SPMD
and probably MIMD will be necessary.

Randy

--
Randy Crawford http://www.ruf.rice.edu/~rand rand AT rice DOT edu

"If English was good enough for Jesus Christ, it ought to be good enough
for the children of Texas." -- Texas Governor Ma Ferguson (1924)

Chris Colohan

unread,
Aug 25, 2005, 3:29:47 PM8/25/05
to
tor...@app-6.diku.dk (Torben Ægidius Mogensen) writes:
> Research in automatically recognising parallelism in programs written
> in mainstream languages has largely failed (with the exception of
> numerical code written in Fortran), so IMO we need either to use
> languages where parallelism is either explicit (and verifiably
> independent) or recognizable with methods sufficiently robust that
> small code changes won't break parallelism. The latter will probably
> require restrictions that almost amount to explicit declaration of
> parallelism, so the first way may be the way to go.

I hate to put a plug in for my own work, but there is an interesting
point in the middle. The choice is not just "explicit parallelism"
(with all of the pain of creating parallel programs) or "fully
automatic parallelism" (with no benefits if your compiler is unable to
find suitable threads).

There has been a whole load of work in the past 15 years or so on
speculative parallel architectures. The main work in this area
started with the Multiscalar architecture out of Wisconsin, and other
projects in the area include Hydra (Stanford), IACOMA (Illinois), RAW
(MIT), and STAMPede (my project at CMU). In these architectures you
mark thread boundaries in a sequential program (written in any
language, including C or even assembly), and the machine will execute
the code in parallel, while maintaining the original sequential
semantics. It does this by detecting data dependences which exist
between your threads and either partially or completely restarting
threads which consume mis-speculated values.

Under the thread-level speculation (TLS) programming model the
programmer starts with a sequential program, and simply marks where
new threads should be spawned. They run it on a TLS machine, and they
get a parallel _and correct_ execution. At first, the performance may
not be that good, since many threads will restart due to dependences.
The TLS machine can provide profile feedback which lists the most
frequent dependences, and the programmer can use this to optimize
their program -- they can either modify the code or they can change
their thread spawn points to avoid dependences. Using this method a
programmer can _gradually_ add parallelism to their program: first
they add threads, and then they only have to change the parts of their
code which are frequently executed _and_ have frequent dependences
between the threads.

I have been working on applying these techniques to database
transactions for my thesis, and will be presenting a paper on this at
VLDB next week. If you are interested, the paper (and a tech report
on the hardware aspects) are linked off of my home page:
www.colohan.com

Chris
--
Chris Colohan Email: ch...@colohan.ca PGP: finger col...@cs.cmu.edu
Web: www.colohan.com Phone: (412)268-4751

Scott A Crosby

unread,
Aug 25, 2005, 3:32:59 PM8/25/05
to
On Thu, 25 Aug 2005 14:35:57 GMT, "Stephen Fuld" <s.f...@PleaseRemove.att.net> writes:

> ""Torben Ægidius Mogensen"" <tor...@app-6.diku.dk> wrote in message
> news:7zll2qt...@app-6.diku.dk...

> I'd like to explore this further. ISTM that there are only a few
> languages that have any explicit way to declare parallelism and that
> they differ in the "level" of parallelism they express. For
> example, I believe that Occam expresses parallelism at a very low
> level (even evaluation of a single expression can be specified to be
> in parallel. But COBOL offers explicit parallelism at the level
> similar to a full thread. I'm not sure which, if either of these is
> the "right" thing to do.

I'm not sure I like this. What if a programmer gets such a declaration
wrong? Immediate crash, random unreliability, or would the
consequences be pretty mild.

> So, the questions are
>
> - Is there an existing language that expresses parallelism at the right
> level for multi-core/multithreaded core to take advantage of? Is it easy
> for programmers to use the language productivly?
>
> - If not what would such a language look like? That is, what level of
> parallelism is appropriate? How would the semantics/syntax be designed to
> make program development easiest and most bug free?

A purely functional language has the property that once written, data
is never modified. Thus, in such a language, all evaluation except for
IO would be side-effect free and thus the store is no longer a
syncronization point. Even in languages that aren't purely functional
like OCaml, most of the code (that I at least have written) is purely
functional or written in a functional style. That's the language style
that the language encourages.

OCaml doesn't have any paralleling compilers that I know about, but I
will speculate on how it could be done.

The compiler can determine which functions are not purely functional
or 'pure'. This could involve just computing the transitive closure of
any function thats inpure or invokes an inpure function. A more
sophisticated analysis would perform an escape analysis on any store
that is mutated. Even if a nominally impure function mutates a store
that cannot be shared then the function is effectively pure.)

Because of the relatively low amount of mutation in the style in which
these languages are used, tracking pointers and alias analysis on the
few mutations that do occur should be much more robust.

The compiler can makes sure that accesses to mutable state such as IO
or mutable store get appropriately ordered&syncronized. An algorithm
for doing this is that any thread that's about to alter mutable state
or do any IO is suspended until all earlier in program order impure
threads run to completion first. If the language spec states that
argument evaluation order is undefined, that relaxes the constraints
on ordering --- and the consequent syncronization requirements.

Scott

Mitch...@aol.com

unread,
Aug 25, 2005, 3:57:53 PM8/25/05
to
I think the problem is different.

All synchronization in current machines is memory based. Memory latency
remains rather constant as processor frequency scales to every higher
numbers. So 10 years ago when CPUs were at 200 MHz and main memory was
150 ns away, one could conceivably perform 30 instructions between
synchronization events (and 300 instructions between synchronizations
is vastly more reasonable). Now we have 3GHz machines with 120ns memory
access times. So the granularity for synchronization is 20X worse today
than it was just 10 years ago! As the size of granulation changes, the
structure of the parallel computation should change to follow.

In addition, contention during synchronization causes cubic (or worse)
amounts of memory traffic in order to pass successfully through a
critical region, even with non-blocking (or similar) programming
styles. This excess traffic slows down synchronization, and clogs up
the DRAM controllers wtih excess requests that don't lead to work
getting done, and wipe lines from caches, exasterbating the
interference itself!

Unless and until, someone addresses the granlation problem and the
exponent of memory traffic interference, not much can be done at the
language level to do much more than ameliorate the problems. I have a
purported solution, that NDA-able people may get to hear about in 6
months or so.

Rupert Pigott

unread,
Aug 25, 2005, 5:15:17 PM8/25/05
to

Mitch...@aol.com wrote:
> I think the problem is different.
>
> All synchronization in current machines is memory based. Memory latency
> remains rather constant as processor frequency scales to every higher
> numbers. So 10 years ago when CPUs were at 200 MHz and main memory was
> 150 ns away, one could conceivably perform 30 instructions between
> synchronization events (and 300 instructions between synchronizations
> is vastly more reasonable). Now we have 3GHz machines with 120ns memory
> access times. So the granularity for synchronization is 20X worse today
> than it was just 10 years ago! As the size of granulation changes, the
> structure of the parallel computation should change to follow.

That is a very good point, one I've been worrying about that a bit
over the last decade. I think Jon Jakson has the right approach
here, he's looking at an MTA style core with flat memory model (as
far as latency goes). My feeling is that this approach could really
help because you have the option using explicit the *in-order*
execution to achieve zero overhead comms/sync as well as the option
of the slower higher-order methods.

It's not a free lunch though, so there's the "Small Matter of
Programming" to take care of later, but at least the HW gives you
plenty of options when it comes to granularity.


Cheers,
Rupert

Nick Maclaren

unread,
Aug 25, 2005, 5:36:13 PM8/25/05
to
In article <1124999873....@o13g2000cwo.googlegroups.com>,

<Mitch...@aol.com> wrote:
>I think the problem is different.
>
>All synchronization in current machines is memory based. ...

I wish :-(

The vast majority is far worse. Because of a lack of support by both
the hardware and operating system, most is system-call based. This
is sometimes "optimised" by using spin loops to attempt to use memory
as a poor man's substitute for the proper primitives. Both of these
approaches are unsatisfactory, at best.

But we don't even get that, because at least Unix schedulers are based
around clock interrupts, often with a huge interval. I don't think
that Microsoft ones are any better. As soon as you allow the scheduler
a look in, you are liable to be blocked for 10 mS - even if there is
no other process that should be run!


Regards,
Nick Maclaren.

Joe Seigh

unread,
Aug 25, 2005, 6:08:23 PM8/25/05
to

Scheduler artifacts are fun to deal with. On OS X it appears that for
short sleeps using nanosleep() rather than setting a kernel timer
event, the thread just gets put on the ready queue, the assumption
being it would take you at least that much time to get dispatched again
based on cpu speed and dispatcher code path. Except if there are lots
of threads already on the run queue it could be a very long wait.

I think the previous poster is refering to synchronizing memory itself
rather than by some other mechanism like the TOD clock or direct
processor to processor communication.

David Gay

unread,
Aug 25, 2005, 7:03:36 PM8/25/05
to

The FX project at MIT explored this in the late 80s... Google for
`"fx project" mit parallel' to get some useful references. My understanding
is that the results were not overwhelming. One major issue (alluded to
elsewhere in this thread) is deciding when to spawn a new thread vs execute
things in sequence.

--
David Gay
dg...@acm.org

JJ

unread,
Aug 25, 2005, 7:25:50 PM8/25/05
to
What Rupert refers to will be presented at cpa2005 in a few weeks.

"New Transputer Design for FPGAs" although FPGAs are just a away to
prototype the architecture and not the main idea presented.

In effect a collection of 4 way threaded light PEs "ALT"ing into an 8
way threaded (or interleaved), hashed RLDRAM. These PEs themselves look
quite a bit like sparcs with register sliding reg cache per thread but
thats not important either, in fact any ISA would work here even x86
:-/

The real idea is that threaded cpus with conventional cache DRAM
hierarchy are even worse than single threaded cpus putting even more
pressure on the cache and hence more threads, less locality. They can
in theory churn through more ops though but only if they can be fed and
it yes it would be a nightmare to syncronize these threads through
120ns DRAM.

The right way to help multithreaded cpus is to multithread the memory
too, which can be done with RLDRAM (8 way issue every 2.5ns rate) and
effectively gives each PE a mem access of a few instructions slots
across the entire DRAM array hence no data cache. It's not raw
bandwidth or latency that matters, its the rate of uncorrelated memory
issues that can be dispensed across multiple banks on behalf of a no of
cooperating (or not) threads. This leads into a hashing MMU and
protected objects done in the clock cycles. Concurency support for
occam etc provided with protected objects.

For the application layer I am inclined to see if occam and Objective C
makes more sense rather than C++, but thats another paper

transputer2 at yahoo ....
johnjakson at usa ...

Rupert Pigott

unread,
Aug 25, 2005, 7:54:42 PM8/25/05
to
Yikes, premature senility strikes... I missed an 'h' in
your name... Currently working with 3 "Jon"s which does
not help either. :P

Kees van Reeuwijk

unread,
Aug 25, 2005, 8:02:13 PM8/25/05
to
Nick Maclaren <nm...@cus.cam.ac.uk> wrote:

> Not in the domains where people do that seriously - admittedly
> rather specialised. As I said at SunHPC, getting arbitrary
> parallel designs right is a nightmare task, but there are several
> restricted models that are feasible to debug. Experienced
> parallel programmers use one of them.
>
> For example, using lockstep models (BSP, MPI collectives etc.)
> is feasible. So are many classes of dataflow.

Another important model is divide-and-conquer, and its degenerate case
farmer-worker.

Some people argue that any programming model that restricts the
dependencies between tasks to planar graphs (SP graphs) simplifies
parallel programming enough to make it tractable. At least BSP and
divide-and-conquer indeed have this structure.

Kelly Hall

unread,
Aug 25, 2005, 9:25:21 PM8/25/05
to
Dan Koren wrote:
> Yes: this is why APL and its descendants (APL2, A, J, K) are
> the darlings of the financial community.

Over the years I've heard of a number of technologies being described as
the "darlings of the financial community". So far, my short list of
these darlings includes: NeXT computers / NeXTstep; Smalltalk, prolog,
expert systems, and now APL.

I've only known two actul people to have been programmers for banks: one
wrote in assembly for the System/370 mainframe, and the the other wrote
in Visual Basic.

Kelly

JJ

unread,
Aug 25, 2005, 9:57:47 PM8/25/05
to
I think APL has been used in business for ever, lots of job ads
mentioned it. Its funny that Smalltalk-NeXTstep have Obj-C in common
too, without which NeXT probably wouldn't have happened.

APL is a neat language, at one time it was also used as a hardware
description language for a very old comp arch text "digital computer
systems principles" by Hellerman 1967,1973. It uses APL more to
describe microcode snips than to actually model hardware. A section on
IBM 360,370 architecture in there too.

JJ

Anne & Lynn Wheeler

unread,
Aug 25, 2005, 11:12:00 PM8/25/05
to

cambridge science center ported part of apl\360 to cms for cms\apl.
apl\360 had a monitor and its own multitasking swapping monitor.
typical apl\360 workspaces were 16k to 32k bytes ... and the
multitasking monitor would swap the whole workspace at a time.
http://www.garlic.com/~lynn/subtopic.html#545tech

apl\360 had memory management that on every assignment assigned the
next available space (in the workspace) ... and marked any previous
allocation unused. when allocation reached the end of the workspace,
it would do garbage collection and start all over.

moving to cms\apl, the garbage collection had to be completely
reworked. cms\apl might allow several mbyte virtual memory page
workspace. the apl\360 strategy would appear like a page thrashing
program running in a "large" virtual memory environment. the other
notable thing done for cms\apl was a mechanism that allowed
interfacing to system calls. this caused some consternation among the
apl aficionados since it violated the purity of apl. this wasn't
rectified until the introduction of shared variables as a mechanism
for interfacing to system calls/functions.

early on, the science center provided some "large-memory" online apl
service to other parts of the corporation. one of the customers were
the business people from corporate hdqtrs which was using the large
memory apl capability to do what-if scenarios using the most sensitive
of corporate business information. this also offered some security
challenge because the was amount of access to the science center
system by students from various univ. & colleges in the boston area.

there was quite a bit of apl use for things that were later done using
spreadsheets.

one of the largest commercial time-sharing operations
http://www.garlic.com/~lynn/subtopic.html#timeshare

become the internal HONE system
http://www.garlic.com/~lynn/subtopic.html#hone

which provided world-wide online dataprocessing services to all
marketing, sales, field people. the HONE environment offered services
were almost all implemented in APL (starting with cms\apl, involving
into apl\cms, then apl\sv, etc). one specific application developed by
the science center was the performance predictor which was a detailed
analytical system operation. salesmen could obtained workload and
configuration information from a customer and then do what-if
questions about changes in workload and/or configuration.

starting with 370 115/125, a salesman could no longer even submit an
order w/o it having been processed by a HONE "configurator".

HONE was one of my hobbies for quite some time. when emea hdqtrs moved
from westchester to la defense outside paris ... i got to do some
amount of the work getting HONE up at running at the new data center
(at that time, there were three new bldgs ... not completely finished,
and the grounds around the bldgs was bare dirt .... landscaping work
was still in progress).

--
Anne & Lynn Wheeler | http://www.garlic.com/~lynn/

Dan Koren

unread,
Aug 26, 2005, 12:00:10 AM8/26/05
to
"Kelly Hall" <kh...@acm.org> wrote in message
news:5cuPe.701$5k1...@newssvr27.news.prodigy.net...

> Dan Koren wrote:
>> Yes: this is why APL and its descendants (APL2, A, J, K) are
>> the darlings of the financial community.
>
> Over the years I've heard of a number of technologies being described as
> the "darlings of the financial community". So far, my short list of these
> darlings includes: NeXT computers / NeXTstep; Smalltalk, prolog, expert
> systems, and now APL.


Not "now". It's been going on for a long time.


> I've only known two actul people to have been programmers for banks: one
> wrote in assembly for the System/370 mainframe, and the the other wrote in
> Visual Basic.
>


I am not referring to banks.

I am referring to large trading/investment
houses like Morgan-Stanley, Merrill-Lynch,
Goldman-Sachs, etc...

Morgan-Stanley even developed their own
APL version, and made it available as
open source.

dk

ram...@bigpond.net.au

unread,
Aug 26, 2005, 12:35:22 AM8/26/05
to
Kelly Hall <kh...@acm.org> writes:

> as the "darlings of the financial community". So far, my short list
> of these darlings includes: NeXT computers / NeXTstep; Smalltalk,
> prolog, expert systems, and now APL.
> I've only known two actul people to have been programmers for banks:
> one wrote in assembly for the System/370 mainframe, and the the other
> wrote in Visual Basic.

The reality is much more prosaic.
Have a look at this message from another forum, about the
natural gas trading system at Enron.

-------------------------------------------------------------------------

> Reminds me of a project at Enron. The user application was Delphi,
> which used a VC DLL that used CORBA to communicate with a Java layer
> which then updated an Oracle database and returned the data written
> to it by reversing the pathway above. The user app then updated a
> second Oracle database with the same data written by the Java layer.

That would be Sitara, the physical Natural Gas trading system.

> The DB updated by the Delphi app used a somewhat well-normalized set
> of tables, but this DB was only used for reporting. The actual
> "serious" database, written to by the Java app, contained a single
> table. That table contained a single column, which was a
> VARCHAR2(2048) or some such nonsense; the same data returned to the
> Delphi app was written in a single row in this table.

That's it! There were actually two reporting DBs, with slightly
different schemas. (No idea why.) The reporting database started out as
debugging tool. However, they soon realized that they needed to get
data out of Sitara for use in other systems, so the reporting DBs
became official. (And the main DB used CLOBS not varchar2s.)
Why did they store their data in long strings? Because that way they
could do whatever they wanted with their data model, without having a a
DBA or Data Analyst get in their way. This decision was later
recognized as being a bad mistake.

Enron's trading systems were a maze of twisty passages, all
different. In addition to what was mentioned above, the data would go
from Sitara to the old Gas Trading system called CPR. Another system
called ERMS had views into the CPR tables, and was used to calculate
the trader's positions. The only problem was that CPR only got a
month's worth of data from Sitara, so if a deal was longer than a
month, you also booked it in the financial trading system (TAGG).
Simple, no?
---------------------------------

Amusing in the extreme.

Nick Maclaren

unread,
Aug 26, 2005, 2:58:10 AM8/26/05
to
In article <ea6dnbEhhfu...@comcast.com>,

Joe Seigh <jsei...@xemaps.com> wrote:
>
>Scheduler artifacts are fun to deal with. On OS X it appears that for
>short sleeps using nanosleep() rather than setting a kernel timer
>event, the thread just gets put on the ready queue, the assumption
>being it would take you at least that much time to get dispatched again
>based on cpu speed and dispatcher code path. Except if there are lots
>of threads already on the run queue it could be a very long wait.

Yes. I haven't investigated serious parallelism on that system,
but would be surprised if it didn't have at least as many gotchas
as the ones I have looked at.

>I think the previous poster is refering to synchronizing memory itself
>rather than by some other mechanism like the TOD clock or direct
>processor to processor communication.

Well, yes, but there are lots of issues here. One is that parallel
compilers/libraries/whatever use the least unsuitable mechanisms
and another is that multiple synchronisation facilities interact
in nasty ways.

The ridiculous thing is that most of this is so unnecessary, but
it isn't soluble because the people who are designing the systems
don't have any direct communication with the people who know about
the requirements and solutions. All right, there are getting to be
damn few of the latter :-(


Regards,
Nick Maclaren.

Nick Maclaren

unread,
Aug 26, 2005, 3:00:33 AM8/26/05
to
In article <1h1vp39.sn58yjk7bf7qN%reeu...@few.vu.nl>,

Kees van Reeuwijk <reeu...@few.vu.nl> wrote:
>Nick Maclaren <nm...@cus.cam.ac.uk> wrote:
>
>> Not in the domains where people do that seriously - admittedly
>> rather specialised. As I said at SunHPC, getting arbitrary
>> parallel designs right is a nightmare task, but there are several
>> restricted models that are feasible to debug. Experienced
>> parallel programmers use one of them.
>>
>> For example, using lockstep models (BSP, MPI collectives etc.)
>> is feasible. So are many classes of dataflow.
>
>Another important model is divide-and-conquer, and its degenerate case
>farmer-worker.

Yes. I omitted those because you can fairly regard them as low
communication (i.e. trivial) cases. But why communicate if you
have nothing to say? :-)

>Some people argue that any programming model that restricts the
>dependencies between tasks to planar graphs (SP graphs) simplifies
>parallel programming enough to make it tractable. At least BSP and
>divide-and-conquer indeed have this structure.

It's a plausible viewpoint, but I think that I could produce fair
counter-examples. However, it gives a good indication of the
maximum level of complexity that mere mortals can handle.


Regards,
Nick Maclaren.

Joe Seigh

unread,
Aug 26, 2005, 9:21:17 AM8/26/05
to
Nick Maclaren wrote:
> In article <ea6dnbEhhfu...@comcast.com>,
> Joe Seigh <jsei...@xemaps.com> wrote:
>
>>Scheduler artifacts are fun to deal with. On OS X it appears that for
>>short sleeps using nanosleep() rather than setting a kernel timer
>>event, the thread just gets put on the ready queue, the assumption
>>being it would take you at least that much time to get dispatched again
>>based on cpu speed and dispatcher code path. Except if there are lots
>>of threads already on the run queue it could be a very long wait.
>
>
> Yes. I haven't investigated serious parallelism on that system,
> but would be surprised if it didn't have at least as many gotchas
> as the ones I have looked at.

Synchronization standards such as Posix pthreads don't define
forward progress let alone minimum standards for it. This means
implementations can vary widely in performance. Some of the
common thread design patterns are to get around poorly performing
implementations. E.g. thread pools and other forms of lighter
weight threading to mention a few.

>
>
>>I think the previous poster is refering to synchronizing memory itself
>>rather than by some other mechanism like the TOD clock or direct
>>processor to processor communication.
>
>
> Well, yes, but there are lots of issues here. One is that parallel
> compilers/libraries/whatever use the least unsuitable mechanisms
> and another is that multiple synchronisation facilities interact
> in nasty ways.
>
> The ridiculous thing is that most of this is so unnecessary, but
> it isn't soluble because the people who are designing the systems
> don't have any direct communication with the people who know about
> the requirements and solutions. All right, there are getting to be
> damn few of the latter :-(
>

It's worse. I mess around with lock-free stuff (like hazard pointers
which don't need memory barriers in smp environments) and there are no
communication channels to the hw vendors to discuss issues. Intel
*wants* people to start exploiting their multi-core stuff so they
can sell them but they don't support the people who are trying to
make this possible in a scalable manner.

Even the research and academic area is squirrely. I don't work
in that area so I don't know what's going on for sure. There
seem to be turf and clique things going on with the various groups
of researchers. I can't reveal the few details I do know about.

Stephen Fuld

unread,
Aug 26, 2005, 12:49:47 PM8/26/05
to

"Greg Lindahl" <lin...@pbm.com> wrote in message
news:430e0315$1...@news.meer.net...

> In article <hHkPe.133637$5N3.1...@bgtnsc05-news.ops.worldnet.att.net>,
> Stephen Fuld <s.f...@PleaseRemove.att.net> wrote:
>
>>I'd like to explore this further. ISTM that there are only a few
>>languages
>>that have any explicit way to declare parallelism and that they differ in
>>the "level" of parallelism they express.
>
> After 30 years of research, there are more than just a few which have
> been proposed, experimented with, and spilled much ink on papers about
> the results, etc.

Yes, but, as demonstrated by some of the posts in this thread, there is some
new research in different directions.

> I'm not sure what you hope to accomplish with a comp.arch discussion
> of such a huge body of work.

Pretty much what has happened. :-) (I don't claim credit for that, as most
appear to be responses to the OP, not mine, but effect is still good.) Some
information on current status of old ideas and some information and pointers
to newer research.

--
- Stephen Fuld
e-mail address disguised to prevent spam


Nick Maclaren

unread,
Aug 26, 2005, 1:05:05 PM8/26/05
to
In article <LKHPe.678440$cg1.5...@bgtnsc04-news.ops.worldnet.att.net>,

Stephen Fuld <s.f...@PleaseRemove.att.net> wrote:
>>
>> After 30 years of research, there are more than just a few which have
>> been proposed, experimented with, and spilled much ink on papers about
>> the results, etc.
>
>Yes, but, as demonstrated by some of the posts in this thread, there is some
>new research in different directions.

New research, perhaps. Different directions, no. I haven't seen
anything that isn't work on approaches that haven't been flogged
to death several times before[*].

We know where we can get starting from third-generation von Neumann
languages, and it isn't anywhere very useful. C and C++ may be
hopeless, but Fortran isn't much better. We know where we can get
with improved paradigms (including strictly constrained variants
of Fortran and even C or C++), and it is a lot further, but only
about as far as we can get with the more abstract programming
languages.

We know that it isn't generally feasible to get to those paradigms
starting from a large, complex program that is written to a serial
von Neumann design. And we know the reluctance of the programming
community (often the beancounters, rather than the programmers) to
move away from serial von Neumann.

There is certainly a possibility that someone will make a breakthrough
where hundreds of people have failed before, but I am not holding my
breath waiting for it. I know of NO approach that has a significant
chance of success until and unless we are prepared to abandon the
serial von Neumann paradigm.

What new directions have been proposed? If I missed one, please
remind me.

[*] Yes, they have all the characteristics of zombie mules.


Regards,
Nick Maclaren.

Nick Maclaren

unread,
Aug 26, 2005, 1:52:20 PM8/26/05
to
In article <ucloe7l...@cilento.stampede.cs.cmu.edu>,

Chris Colohan <colo...@cs.cmu.edu> wrote:
>
>I hate to put a plug in for my own work, but there is an interesting
>point in the middle. The choice is not just "explicit parallelism"
>(with all of the pain of creating parallel programs) or "fully
>automatic parallelism" (with no benefits if your compiler is unable to
>find suitable threads).

Well, yes. That is essentially what OpenMP does, and has been a
standard method for at least 25 years. What it does is to decouple
the issue of identifying parallelism (and its converse, aliasing)
and say that it is someone else's problem. Note that I am NOT saying
that isn't justifiable, merely that it shouldn't be claimed to be a
complete solution.

>There has been a whole load of work in the past 15 years or so on
>speculative parallel architectures. The main work in this area
>started with the Multiscalar architecture out of Wisconsin, and other
>projects in the area include Hydra (Stanford), IACOMA (Illinois), RAW
>(MIT), and STAMPede (my project at CMU). In these architectures you
>mark thread boundaries in a sequential program (written in any
>language, including C or even assembly), and the machine will execute
>the code in parallel, while maintaining the original sequential
>semantics. It does this by detecting data dependences which exist
>between your threads and either partially or completely restarting
>threads which consume mis-speculated values.

Right. The basic approach is much older than 15 years, but the one
of detecting aliasing and restarting when it is encountered is fairly
new. However, it has three obvious fundamental problems:

1) There has to be a clean definition of aliasing, and it has
to be detectable. Despite what you say above, you CAN'T do it in C,
for reasons I describe in my Objects diatribe (please ask for a copy
if you want). You might be able to do it in a suitable variant of C,
if you have the stomache to define one :-(

2) The threads have to be such that they ARE restartable. That
imposes a pretty considerable constraint on the programming paradigm,
and cannot practically be done in any third generation von Neumann
language, without imposing draconian restrictions.

3) It will be effective only if such aliasing is sparse. The
behaviour of such retry algorithms is well understood, and they decay
catastrophically as the number of conflicts increases. Again, this
imposes constraints on the programming paradigm.

>Under the thread-level speculation (TLS) programming model the
>programmer starts with a sequential program, and simply marks where
>new threads should be spawned. They run it on a TLS machine, and they
>get a parallel _and correct_ execution. At first, the performance may
>not be that good, since many threads will restart due to dependences.

I am sorry, but I don't believe that bald statement. You must add
extra constraints or mechanisms to ensure that threads ARE restartable.
If you take a checkpoint, you are multiplying the store requirements
(even excluding the evil problem of non-memory side-effects, which
are so critical for real programs).

How do you ensure that you can restore a thread's state to its initial
one when restarting it? Memory, file states, CPU flags, signals and
all?

>The TLS machine can provide profile feedback which lists the most
>frequent dependences, and the programmer can use this to optimize
>their program -- they can either modify the code or they can change
>their thread spawn points to avoid dependences. Using this method a
>programmer can _gradually_ add parallelism to their program: first
>they add threads, and then they only have to change the parts of their
>code which are frequently executed _and_ have frequent dependences
>between the threads.

Bitter experience is that, even when that is possible, it leads to
very poor use of parallelism and low scalability. Unless you design
in parallelism (often by NOT designing in seriality), you usually
end up with no more than a factor of 2-4 improvement. Few practical
programmers will bother to modify their programs for that level of
improvement, as it is easier to wait a couple of years.

There are problem areas where this is not true, but not all that many.

>I have been working on applying these techniques to database
>transactions for my thesis, and will be presenting a paper on this at
>VLDB next week. If you are interested, the paper (and a tech report
>on the hardware aspects) are linked off of my home page:
>www.colohan.com

Ah. Now, THAT I can believe. Typical database designs (well, good
ones, anyway) have well-defined aliasing rules and their primitive
operations are designed to be retriable. Also many of their uses
have fairly sparse aliasing in practice.


Regards,
Nick Maclaren.

David Gay

unread,
Aug 26, 2005, 2:17:10 PM8/26/05
to

I think you're missing the bit about "hardware support" here. The
conflicts are defined in terms of virtual addresses, and the processor
fully supports restoring itself to an earlier state. There are a few
minor (well ok, not so minor) issues of course:
- can the hardware really rollback arbitrarily far?
- what about aliasing of virtual addresses to each other?
- what about I/O? (you mentioned that one :-))

Chris's paper talks about how some of these (and your next point) are
resolved in a real body of code; it's definitely worth a read.

> 3) It will be effective only if such aliasing is sparse. The
> behaviour of such retry algorithms is well understood, and they decay
> catastrophically as the number of conflicts increases. Again, this
> imposes constraints on the programming paradigm.

--
David Gay
dg...@acm.org

Nick Maclaren

unread,
Aug 26, 2005, 2:32:53 PM8/26/05
to
In article <s71irxs...@beryl.CS.Berkeley.EDU>,

David Gay <dg...@beryl.CS.Berkeley.EDU> wrote:
>>
>> 2) The threads have to be such that they ARE restartable. That
>> imposes a pretty considerable constraint on the programming paradigm,
>> and cannot practically be done in any third generation von Neumann
>> language, without imposing draconian restrictions.
>
>I think you're missing the bit about "hardware support" here. The
>conflicts are defined in terms of virtual addresses, and the processor
>fully supports restoring itself to an earlier state. There are a few
>minor (well ok, not so minor) issues of course:
>- can the hardware really rollback arbitrarily far?
>- what about aliasing of virtual addresses to each other?
>- what about I/O? (you mentioned that one :-))

No, I didn't. You are right that they aren't minor - in particular,
it is very likely that a thread will update a large proportion of
its data before a conflict is discovered. That was the point of my
remark about checkpoints.

There are also non-memory CPU states (e.g. IEEE 754 modes and flags),
signals and numerous other 'hidden' data. The existence of these is
why checkpoint/restart for arbitrary programs has failed dismally
every single time it has been introduced over a period of 35+ years.
I fail to see that this approach is all that different - if it claims
to be able to handle arbitrary programs :-(

>Chris's paper talks about how some of these (and your next point) are
>resolved in a real body of code; it's definitely worth a read.

Yes, I intend to. As I said, database codes are one area where this
might work pretty well.


Regards,
Nick Maclaren.

Sander Vesik

unread,
Aug 26, 2005, 2:40:21 PM8/26/05
to
Nick Maclaren <nm...@cus.cam.ac.uk> wrote:
>
> There is certainly a possibility that someone will make a breakthrough
> where hundreds of people have failed before, but I am not holding my
> breath waiting for it. I know of NO approach that has a significant
> chance of success until and unless we are prepared to abandon the
> serial von Neumann paradigm.
>

But are we prepared? A large part of the progarmmers manage to output
really bad code for even 'serial von Neumann' and tend to cope best
with very simplistic locking strategies if paralellism in any way is
needed. And much prefer the 'lock it all' strategy, esp after they can't
debug the other ones.

>
> Regards,
> Nick Maclaren.
>

--
Sander

+++ Out of cheese error +++

David Gay

unread,
Aug 26, 2005, 2:53:06 PM8/26/05
to

nm...@cus.cam.ac.uk (Nick Maclaren) writes:
> In article <s71irxs...@beryl.CS.Berkeley.EDU>,
> David Gay <dg...@beryl.CS.Berkeley.EDU> wrote:
> >>
> >> 2) The threads have to be such that they ARE restartable. That
> >> imposes a pretty considerable constraint on the programming paradigm,
> >> and cannot practically be done in any third generation von Neumann
> >> language, without imposing draconian restrictions.
> >
> >I think you're missing the bit about "hardware support" here. The
> >conflicts are defined in terms of virtual addresses, and the processor
> >fully supports restoring itself to an earlier state. There are a few
> >minor (well ok, not so minor) issues of course:
> >- can the hardware really rollback arbitrarily far?
> >- what about aliasing of virtual addresses to each other?
> >- what about I/O? (you mentioned that one :-))
>
> No, I didn't. You are right that they aren't minor - in particular,
> it is very likely that a thread will update a large proportion of
> its data before a conflict is discovered. That was the point of my
> remark about checkpoints.
>
> There are also non-memory CPU states (e.g. IEEE 754 modes and flags),
> signals and numerous other 'hidden' data.

Well I would presume that those ones would be handled by an in-CPU
implementation. I'm more worried about the ones outside the CPU, which
I think could all be reasonably categorised as I/O.

--
David Gay
dg...@acm.org

glen herrmannsfeldt

unread,
Aug 26, 2005, 4:44:58 PM8/26/05
to
Nick Maclaren wrote:

(snip)

> There are also non-memory CPU states (e.g. IEEE 754 modes and flags),
> signals and numerous other 'hidden' data. The existence of these is
> why checkpoint/restart for arbitrary programs has failed dismally
> every single time it has been introduced over a period of 35+ years.
> I fail to see that this approach is all that different - if it claims
> to be able to handle arbitrary programs :-(

I remember reading about Checkpoint/Restart in OS/360 many years ago,
long before I had any programs where it would be useful.

Well, I learned a lot about OS design from reading IBM manuals such
as Supervisor Services and Macro Instructions, and other manuals
related to details about OS/360. I even read the PCP (Fixed Task
Supervisor) Program Logic Manual which also may have described
Checkpoint/Restart.

I don't think I have ever known anyone to actually use it, though.

-- glen

Hank Oredson

unread,
Aug 26, 2005, 5:18:27 PM8/26/05
to
"Nick Maclaren" <nm...@cus.cam.ac.uk> wrote in message
news:denn8l$3c5$1...@gemini.csx.cam.ac.uk...


This whole discussion reminds me of DBMS work on concurrency.

One might need locking at db, table, row, field level.
There might be multiple pending updates at each level.
The update / commit / rollback paradigm has been well-explored.
Recovery without transaction loss is almost always required.
Etc.

Perhaps there are some lessons from that world that apply
to the lower level issues under discussion in this thread ?

"Beavis ... heh heh ... he said "thread" ... heh heh."

--

... Hank

http://home.earthlink.net/~horedson
http://home.earthlink.net/~w0rli


Rupert Pigott

unread,
Aug 26, 2005, 5:22:23 PM8/26/05
to

My feeling is that we are *not* prepared because we are using
von-Neumann tools to solve || problems. The fact that embedded
and hardware types are so different in their approach to SW
speaks volumes to me.

Many of the problems I have seen with "serial von-Neumann" have
been a result of people trying to cram inherently parallel
constructs into sequential models. Hardware types and || types
tend to feel very frustrated by the von-Neumann tools they are
forced to use to model and implement inherently parallel systems.

People talk about the expressive power of C/C++. Sure, it's great
for sequential stuff, but it provides precisely zero expression
for real world constructs such as a simple timeout on I/O. This
is *not* an isolated one-in-a-million situation, it crops up all
the time and yet it simply is not part of the language model.
Instead we have a bunch of libraries (eg: Pthreads) that require
the toolchain to work outside of the language spec to implement
correctly... That is asking for trouble, right ?

That is a gap that needs to be filled. You can't reasonably expect
people to build a moon-rocket with an Adze and a plank of wood.

Hell, I'm not saying anything new and I get the feeling that I
am preaching to the choir. :(

Cheers,
Rupert

Stephen Fuld

unread,
Aug 26, 2005, 6:38:12 PM8/26/05
to

"Nick Maclaren" <nm...@cus.cam.ac.uk> wrote in message
news:deni41$m5i$1...@gemini.csx.cam.ac.uk...

How about the one discussed by Chris Colohan in his post? I don't claim
that it is radically new, but I haven't seen anything like it before, and,
if is successfull, seems to offer a way to get there (or at least closer to
there) from here (where here is existing sequential programs).

Stephen Fuld

unread,
Aug 26, 2005, 6:39:58 PM8/26/05
to

"Nick Maclaren" <nm...@cus.cam.ac.uk> wrote in message
news:denksk$s1d$1...@gemini.csx.cam.ac.uk...

snip

> Bitter experience is that, even when that is possible, it leads to
> very poor use of parallelism and low scalability. Unless you design
> in parallelism (often by NOT designing in seriality), you usually
> end up with no more than a factor of 2-4 improvement. Few practical
> programmers will bother to modify their programs for that level of
> improvement, as it is easier to wait a couple of years.

Two comments.

First, for a start, a factor of 2-4 may be good enough. We are talking here
about "typical", (i.e. not highly tuned HPC) application taking abvantage of
a 2-4 core implementation. If it gets typical programmers started and shows
them, through that experience what kinds of things work well and what kinds
of things cause problems (through the feedback mechanism that Chris
proposed) that will lead to more improvement.

Second, while in the past one could wait a couple of years and expect a 2-4
times impropvement from process, etc. that time seems to be past. I expect
that will cause more emphasis on exploiting the parallelsim for general
applications.

Again, I don't claim an advantage for the kinds of programs you deal with as
they have already had the "easy" paralelism (and often much of the "hard"
parallelsim) already built in.

Stephen Fuld

unread,
Aug 26, 2005, 6:39:59 PM8/26/05
to

"Dan Koren" <dank...@yahoo.com> wrote in message
news:430e0dc3$1...@news.meer.net...
> "Stephen Fuld" <s.f...@PleaseRemove.att.net> wrote in message
> news:hHkPe.133637$5N3.1...@bgtnsc05-news.ops.worldnet.att.net...

snip

>> - Is there an existing language that expresses parallelism at the right
>> level for multi-core/multithreaded core to take advantage of?
>
>

> Yes: APL.

I don't claim to be any kind of expert on APL, so please be gentle with your
answer. Doesn't APL parallelsim come from its ability to express
essentially matrix operations with a single operator, that it a clean
expression of data parallel mechansims? Does it provide any mechanism for
multiple parallel control operations? I'm not deprecating data parallelsim,
just pointing out that it is somewhat limited.

Joe Seigh

unread,
Aug 26, 2005, 6:40:42 PM8/26/05
to
Rupert Pigott wrote:
> People talk about the expressive power of C/C++. Sure, it's great
> for sequential stuff, but it provides precisely zero expression
> for real world constructs such as a simple timeout on I/O. This
> is *not* an isolated one-in-a-million situation, it crops up all
> the time and yet it simply is not part of the language model.
> Instead we have a bunch of libraries (eg: Pthreads) that require
> the toolchain to work outside of the language spec to implement
> correctly... That is asking for trouble, right ?

It's worse than that. Posix pthreads has no formal semantics.
So there's no way to formally prove correctness of implementation.
They're working on thread support in the C++ language (not the
C language). They're working on a memory model for C++.
I'm not sure how it will work out.

I've worked with i/o timeouts in a multi-threaded environment.
It depends on what you're doing, I suppose. How should timed
i/o be expressed?

I've done stuff where I could say it would be nice to have
certain language features, or certain functions in the OS
or hardware. I don't think the problem is with coming up
with these things or saying how they should be expressed.
It's getting concensus on these things. A bit difficult when
some people won't even explicitly state or define what
they're talking about. I know what restarting a thread
means and possible ways of implementing it in software at least.
I have no idea what some people mean when they use that term.
It's easier for me to work on the problems I think are interesting
because I at least know what they are, as opposed to problems
people refuse to define.

Message has been deleted
Message has been deleted
Message has been deleted

Rupert Pigott

unread,
Aug 26, 2005, 7:26:35 PM8/26/05
to
Joe Seigh wrote:
> Rupert Pigott wrote:
> > People talk about the expressive power of C/C++. Sure, it's great
> > for sequential stuff, but it provides precisely zero expression
> > for real world constructs such as a simple timeout on I/O. This
> > is *not* an isolated one-in-a-million situation, it crops up all
> > the time and yet it simply is not part of the language model.
> > Instead we have a bunch of libraries (eg: Pthreads) that require
> > the toolchain to work outside of the language spec to implement
> > correctly... That is asking for trouble, right ?
>
> It's worse than that. Posix pthreads has no formal semantics.
> So there's no way to formally prove correctness of implementation.

Nick and others have gnashed their teeth muchly on this point. :)

> They're working on thread support in the C++ language (not the
> C language). They're working on a memory model for C++.
> I'm not sure how it will work out.

Good ! Let's hope for the best, eh ? :)

> I've worked with i/o timeouts in a multi-threaded environment.
> It depends on what you're doing, I suppose. How should timed
> i/o be expressed?

OCCAM did it thus.

PRI ALT
tim ? AFTER timeout
SEQ
-- timed out
in ? v
SEQ
-- process input

It's not big, not clever, but it covers a fair amount of the
trivial cases that folks encounter in day to day coding. I
figure that C++ syntax might be able to cover it with a try
-catch construct, but that is not as neat & tidy.

Cheers,
Rupert

JJ

unread,
Aug 26, 2005, 7:34:07 PM8/26/05
to

anonymous wrote:
> JJ wrote:
> > What Rupert refers to will be presented at cpa2005 in a few weeks.
> >
> > "New Transputer Design for FPGAs" although FPGAs are just a away to
> > prototype the architecture and not the main idea presented.
> >
> > In effect a collection of 4 way threaded light PEs "ALT"ing into an 8
> > way threaded (or interleaved), hashed RLDRAM. These PEs themselves look
> > quite a bit like sparcs with register sliding reg cache per thread but
> > thats not important either, in fact any ISA would work here even x86
> > :-/
> >
> > The real idea is that threaded cpus with conventional cache DRAM
> > hierarchy are even worse than single threaded cpus putting even more
> > pressure on the cache and hence more threads, less locality. They can
> > in theory churn through more ops though but only if they can be fed and
> > it yes it would be a nightmare to syncronize these threads through
> > 120ns DRAM.
> >
> > The right way to help multithreaded cpus is to multithread the memory
> > too, which can be done with RLDRAM (8 way issue every 2.5ns rate) and
> > effectively gives each PE a mem access of a few instructions slots
> > across the entire DRAM array hence no data cache. It's not raw
> > bandwidth or latency that matters, its the rate of uncorrelated memory
> > issues that can be dispensed across multiple banks on behalf of a no of
> > cooperating (or not) threads. This leads into a hashing MMU and
> > protected objects done in the clock cycles. Concurency support for
> > occam etc provided with protected objects.
> >
> > For the application layer I am inclined to see if occam and Objective C
> > makes more sense rather than C++, but thats another paper
> >
> > transputer2 at yahoo ....
> > johnjakson at usa ...
>
> RLDRAM reads as a better replacement for Direct Ram Bus DRAM for my
> VLIW SMP MPP FORTH theory. Thank you.

Atleast someone paying attention. Micron isn't though, the only way to
get their attention is to talk in terms of networking. Rambus is also
talking up XDR2 for 4? way interleaving too but with conventional
horrid latencies. Atleast you can prototype RLDRAM at 75% of best speed
with high end FPGA at 300MHz access rates and you could even fake the
memory model in SRAM by imposing artificial bank constraints with
timers. That will allow you to do low end prototyping with Spartan SRAM
boards and change the banking ratio to anything you like. There is also
(Fast Cycle) FCDRAM but I think its the same networking architecture
story. Micron FAE even told me they had been asked to increase banking
interleave to 128 by some customers, that would be sweet, almost no
collisions. And next year or 07 they have RLDRAM3 coming with all no's
trimmed 25%, sub 2ns memory issue rates, 15ns latency if you can drive
a 533MHz DDR bus.

johnjakson at usa ...
transputer2 at yahoo ....

Joe Seigh

unread,
Aug 26, 2005, 8:34:18 PM8/26/05
to
Andi Kleen wrote:
> Mitch...@aol.com writes:
>
>
>>All synchronization in current machines is memory based. Memory latency
>>remains rather constant as processor frequency scales to every higher
>>numbers. So 10 years ago when CPUs were at 200 MHz and main memory was
>>150 ns away, one could conceivably perform 30 instructions between
>>synchronization events (and 300 instructions between synchronizations
>>is vastly more reasonable). Now we have 3GHz machines with 120ns memory
>>access times.
>
>
> Hmm - but don't some cache line coherency protocols with more states
> than MESI (like MOESI) allow cache line transfer between CPUs without
> going through memory? In that case the 2x120ns latency would be
> avoided and replaced with the presumably shorter CPU<->CPU latency
> time. Of course the dirty cache line would still need to be eventually
> written back to memory. But that write wouldn't be on the critical path
> and it could be done slowly in the background and just turn into an easier
> "bandwidth problem" compare to a hard "latency problem".
>
> With suitable instructions other tricks might be possible, e.g. the
> new MONITOR/MWAIT on x86 look like they could theoretically optimize
> much more in this space. e.g. it could tell the other CPU to toss
> you over the cache line as soon as it has changed.
>
Some of the lock-free stuff tolerates stale data so you could reduce
cache traffic even further with a slightly different protocol that
let the software have more control over invalidating the cache lines.
Message has been deleted

Nick Maclaren

unread,
Aug 27, 2005, 6:09:09 AM8/27/05
to
In article <1125098795.0...@g43g2000cwa.googlegroups.com>,
Rupert Pigott <dark...@hotmail.com> wrote:

>Joe Seigh wrote:
>>
>> It's worse than that. Posix pthreads has no formal semantics.
>> So there's no way to formally prove correctness of implementation.
>
>Nick and others have gnashed their teeth muchly on this point. :)

That is definitely a matter of public record :-)

>> They're working on thread support in the C++ language (not the
>> C language). They're working on a memory model for C++.
>> I'm not sure how it will work out.
>
>Good ! Let's hope for the best, eh ? :)

Yes. I have joined the BSI C++ panel specifically for this matter.
When the relevant person produces a draft, I shall try to help to
make it tight enough to use without constraining optimisation too
much. This might be tricky ....


Regards,
Nick Maclaren.

Nick Maclaren

unread,
Aug 27, 2005, 6:14:44 AM8/27/05
to
In article <oRMPe.138087$5N3....@bgtnsc05-news.ops.worldnet.att.net>,

I had heard of such approaches. For the reasons I gave in a previous
posting, I am not convinced that it can be made to work in general.

My point isn't that it isn't a useful tool for the back-end (i.e. the
implementation of the parallelism), but that it doesn't help much for
the front-end (i.e. the identification of parallelism and detection
of aliasing). I can see no way to do that that does not involve
requiring programmers to change the paradigms that they use, and 40
years of research and development has failed to get anywhere with
starting from the serial von Neumann paradigm.


Regards,
Nick Maclaren.

Nick Maclaren

unread,
Aug 27, 2005, 6:39:17 AM8/27/05
to
In article <2TMPe.679967$cg1.1...@bgtnsc04-news.ops.worldnet.att.net>,

Stephen Fuld <s.f...@PleaseRemove.att.net> wrote:
>"Nick Maclaren" <nm...@cus.cam.ac.uk> wrote in message
>news:denksk$s1d$1...@gemini.csx.cam.ac.uk...
>
>> Bitter experience is that, even when that is possible, it leads to
>> very poor use of parallelism and low scalability. Unless you design
>> in parallelism (often by NOT designing in seriality), you usually
>> end up with no more than a factor of 2-4 improvement. Few practical
>> programmers will bother to modify their programs for that level of
>> improvement, as it is easier to wait a couple of years.
>
>Two comments.
>
>First, for a start, a factor of 2-4 may be good enough. We are talking here
>about "typical", (i.e. not highly tuned HPC) application taking abvantage of
>a 2-4 core implementation. If it gets typical programmers started and shows
>them, through that experience what kinds of things work well and what kinds
>of things cause problems (through the feedback mechanism that Chris
>proposed) that will lead to more improvement.

Well, that has been said more times than I care to recall, and has
proved to be the converse of the truth (in both respects) every time.
I am almost certain, for reasons given below, that it is false.

>Second, while in the past one could wait a couple of years and expect a 2-4
>times impropvement from process, etc. that time seems to be past. I expect
>that will cause more emphasis on exploiting the parallelsim for general
>applications.

That is possibly true, but we have no historical record to guide us.
I am dubious, but not certain that it is false.

>Again, I don't claim an advantage for the kinds of programs you deal with as
>they have already had the "easy" paralelism (and often much of the "hard"
>parallelsim) already built in.

I was actually referring more to 'general' programs, which I actually
have more experience with (perhaps surprisingly!) There are actually
several fundamental reasons why factors of 2-4 aren't enough:

1) That is not far away from the range that can be obtained by
simply changing the specification slightly, splitting an application
in multiple parts and so on. And those approaches are much easier.
So it will apply only when that has not been done.

2) Adding the type of parallelism described is VERY error-prone
(for the reasons I gave), especially when implemented by the average
software house (with a high turnover of monkeys). Even when I extend
my own code, I often make a mistake, forget a constraint, and introduce
a bug. You CAN'T document everything!

3) The solution to a failure is often to switch off the parallelism
and specify a higher performance computer. While this MAY be getting
less easy, there is still a factor of 2-4 between the slowest and
fastest systems of a particular type based on a single process.

4) In the commercial arena, multiple CPUs have been widespread for
25 years and near-universal for 15. My remark about bitter experience
is based on a LOT of evidence.

5) Related to that, all experience is that the way to use 2-4 cores
for a 'normal' personal or server system is to use them to reduce
context switching and allow the kernel and I/O to run in parallel with
the application processes. Something that is often missed is that
this approach could EASILY be extended to use up to (say) 16 cores,
with very little impact on application code (but a large impact on
standards, especially POSIX).


An explanation of the last. Let us say that we change the I/O design
from a synchronous, copying one to a streaming, asynchronous one.
This would immediately enable the use of extra cores for speeding up
I/O - and, if it were done at every level, that would use a LOT for
the very high I/O commercial codes, networking etc.

The best thing is that it would be largely transparent to the simple
applications, because it can be implemented behind both Fortran and C
I/O. Indeed, I have done it, in both cases :-) It would need a
major change to most TCP/IP implementations, but almost no change to
the specification. And so on.


Regards,
Nick Maclaren.

Alexander Terekhov

unread,
Aug 27, 2005, 10:54:59 AM8/27/05
to

Nick Maclaren wrote:
[...]

> When the relevant person produces a draft,

Don't wait.

http://jupiter.robustserver.com/pipermail/cpp-threads_decadentplace.org.uk

regards,
alexander.

P.S. http://www.hpl.hp.com/personal/Hans_Boehm/tmp/tremblant.pdf

<quote>

It has become increasingly clear that any specification will have
an unavoidable impact on compiler optimization. Some currently
common compiler optimizations need to be adapted to ensure thread
safety. But this also reinforces the urgency for thread support in
C++: Current implementations make it much harder than it should be
to write correct multithreaded code.

Our progress has been slowed by both the technical difficulties of
defining a memory model that is compatible with a high performance
atomics library, and by disagreements about the atomics library
itself.

</quote>

Stephen Fuld

unread,
Aug 27, 2005, 11:25:42 AM8/27/05
to

"Nick Maclaren" <nm...@cus.cam.ac.uk> wrote in message
news:depeek$k5g$1...@gemini.csx.cam.ac.uk...

> In article <oRMPe.138087$5N3....@bgtnsc05-news.ops.worldnet.att.net>,
> Stephen Fuld <s.f...@PleaseRemove.att.net> wrote:
>>
>>>
>>> What new directions have been proposed? If I missed one, please
>>> remind me.
>>
>>How about the one discussed by Chris Colohan in his post? I don't claim
>>that it is radically new, but I haven't seen anything like it before, and,
>>if is successfull, seems to offer a way to get there (or at least closer
>>to
>>there) from here (where here is existing sequential programs).
>
> I had heard of such approaches. For the reasons I gave in a previous
> posting, I am not convinced that it can be made to work in general.

Sorry, I saw your response to Chris only after I posted the previous.

> My point isn't that it isn't a useful tool for the back-end (i.e. the
> implementation of the parallelism), but that it doesn't help much for
> the front-end (i.e. the identification of parallelism and detection
> of aliasing). I can see no way to do that that does not involve
> requiring programmers to change the paradigms that they use, and 40
> years of research and development has failed to get anywhere with
> starting from the serial von Neumann paradigm.

I like the idea of the feedback mechanism he discussed as a "learning
mechanism". Hopefully, programmers will get some easy to use report that
indicates where parallel limiting performance occurred, and what to do about
it. As the proceed, they may be able to anticipate and thus internalize the
results, which would lead incrementally to better programming.

Stephen Fuld

unread,
Aug 27, 2005, 11:35:55 AM8/27/05
to

"Nick Maclaren" <nm...@cus.cam.ac.uk> wrote in message
news:depfsl$ms8$1...@gemini.csx.cam.ac.uk...

OK, so even accepting that, it only "delays the inevitable" for a while
(assuming current trends continue).

> 2) Adding the type of parallelism described is VERY error-prone
> (for the reasons I gave), especially when implemented by the average
> software house (with a high turnover of monkeys). Even when I extend
> my own code, I often make a mistake, forget a constraint, and introduce
> a bug. You CAN'T document everything!

Right. That is why I like Chris' approach. As I understand it, it still
guarantees correctness (at least to the same level as the single thread
program), and gives feedback about where the proposed increased parallelism
failed. Comparing the before and after change reports should indicate a
problem and where it might be.

> 3) The solution to a failure is often to switch off the parallelism
> and specify a higher performance computer. While this MAY be getting
> less easy, there is still a factor of 2-4 between the slowest and
> fastest systems of a particular type based on a single process.

But, again, pending more information, with Chris' approach it isn't and "all
or nothing" thing to turn on or off parallelism. You leave in the
directives where it helps, and remove them where the report says it is a
loss. It seems to be pretty easy to "play around" and move the directives
around to evaluate performance, all the time still guaranteesing
correctness.

> 4) In the commercial arena, multiple CPUs have been widespread for
> 25 years and near-universal for 15. My remark about bitter experience
> is based on a LOT of evidence.

Yes, for servers, where inter task parallelism is pretty easy, but not for
desktops. Wheras today there are relativly few existing multi-processor
desktops, within say 5 years, I expect there to be very few uniprocessor
desktops around.

Kees van Reeuwijk

unread,
Aug 27, 2005, 1:22:43 PM8/27/05
to
Nick Maclaren <nm...@cus.cam.ac.uk> wrote:

> In article <1h1vp39.sn58yjk7bf7qN%reeu...@few.vu.nl>,
> Kees van Reeuwijk <reeu...@few.vu.nl> wrote:
> >Another important model is divide-and-conquer, and its degenerate case
> >farmer-worker.
>
> Yes. I omitted those because you can fairly regard them as low
> communication (i.e. trivial) cases. But why communicate if you
> have nothing to say? :-)

It is still necessary to communicate task input and output data. Beyond
toy programs that may be a large.

Nick Maclaren

unread,
Aug 27, 2005, 2:45:36 PM8/27/05
to
In article <43107EC3...@web.de>,

Alexander Terekhov <tere...@web.de> wrote:
>
>Nick Maclaren wrote:
>[...]
>> When the relevant person produces a draft,
>
>Don't wait.
>
>http://jupiter.robustserver.com/pipermail/cpp-threads_decadentplace.org.uk

Thanks. I will take a look when I get a moment.

< Hans Boehm quote>


>
>It has become increasingly clear that any specification will have
>an unavoidable impact on compiler optimization. Some currently
>common compiler optimizations need to be adapted to ensure thread
>safety. But this also reinforces the urgency for thread support in
>C++: Current implementations make it much harder than it should be
>to write correct multithreaded code.

Yeah. That sort of paragraph makes me feel that he understands the
issues.


Regards,
Nick Maclaren.

Nick Maclaren

unread,
Aug 27, 2005, 3:00:38 PM8/27/05
to
In article <vL%Pe.140436$5N3.1...@bgtnsc05-news.ops.worldnet.att.net>,

Stephen Fuld <s.f...@PleaseRemove.att.net> wrote:
>
>> 2) Adding the type of parallelism described is VERY error-prone
>> (for the reasons I gave), especially when implemented by the average
>> software house (with a high turnover of monkeys). Even when I extend
>> my own code, I often make a mistake, forget a constraint, and introduce
>> a bug. You CAN'T document everything!
>
>Right. That is why I like Chris' approach. As I understand it, it still
>guarantees correctness (at least to the same level as the single thread
>program), and gives feedback about where the proposed increased parallelism
>failed. Comparing the before and after change reports should indicate a
>problem and where it might be.

And that is one of the things that I don't believe. Even ignoring
non-memory interactions, there is a BIG problem with partial update
of aggregates. If two threads update two fields in the same structure,
it can invalidate the structure, but the hardware will think that they
are unrelated.

This doesn't just affect contiguous structures, incidentally, but
even ones like Dirichlet tesselations, B-trees and so on. Update
those in parallel, and one thread will be certain to invalidate the
invariants assumed by the other while the former's update is taking
place, leading to corruption. And, of course, the effect is a race
condition of an evil form ....

>But, again, pending more information, with Chris' approach it isn't and "all
>or nothing" thing to turn on or off parallelism. You leave in the
>directives where it helps, and remove them where the report says it is a
>loss. It seems to be pretty easy to "play around" and move the directives
>around to evaluate performance, all the time still guaranteesing
>correctness.

No, that works only with the detected problems. I am referring to the
more serious (and VERY common) ones where the result of parallelisation
is unrepeatable wrong answers. It is very common for binary chop to
fail to help - i.e. the problem appears and disappears more-or-less
independently of WHICH bit of code you serialise.

>> 4) In the commercial arena, multiple CPUs have been widespread for
>> 25 years and near-universal for 15. My remark about bitter experience
>> is based on a LOT of evidence.
>
>Yes, for servers, where inter task parallelism is pretty easy, but not for
>desktops. Wheras today there are relativly few existing multi-processor
>desktops, within say 5 years, I expect there to be very few uniprocessor
>desktops around.

The situation is no different there. Most desktops nowadays are used
to run GUIs, often to run an application that accesses the network.
There is a LOT of inter-task parallelisation that is feasible by a
redesign of the components (as I said, in the network case, without
even a change of specification).

That isn't possible for GUIs without fairly radical changes to the
specification, but I flatly don't believe that his approach will work
for GUIs. They are so unspeakably disgusting that even switching
on quite modest optimisation often causes them to fail in the almost
unlocatable way I describe above. Been there - been defeated by
that :-(


Regards,
Nick Maclaren.

Nick Maclaren

unread,
Aug 27, 2005, 3:03:00 PM8/27/05
to
In article <1h1yvl9.b1umwb1rr3d0mN%reeu...@few.vu.nl>,

Kees van Reeuwijk <reeu...@few.vu.nl> wrote:
>Nick Maclaren <nm...@cus.cam.ac.uk> wrote:
>
>> In article <1h1vp39.sn58yjk7bf7qN%reeu...@few.vu.nl>,
>> Kees van Reeuwijk <reeu...@few.vu.nl> wrote:
>> >Another important model is divide-and-conquer, and its degenerate case
>> >farmer-worker.
>>
>> Yes. I omitted those because you can fairly regard them as low
>> communication (i.e. trivial) cases. But why communicate if you
>> have nothing to say? :-)
>
>It is still necessary to communicate task input and output data. Beyond
>toy programs that may be a large.

True, but my experience is that they can rarely be done in parallel
with processing, except for the case of applications that are
essentially filters. And I fully agree that, in such cases, there
is considerable scope for parallelisation of a very simple form!
Whether you call it dataflow or divide-and-conquer is a matter of
taste - it is dead easy to code correctly.


Regards,
Nick Maclaren.

Andreas Kaiser

unread,
Aug 27, 2005, 4:52:26 PM8/27/05
to
On Fri, 26 Aug 2005 22:39:59 GMT, "Stephen Fuld"
<s.f...@PleaseRemove.att.net> wrote:

>Doesn't APL parallelsim come from its ability to express
>essentially matrix operations with a single operator, that it a clean
>expression of data parallel mechansims?

Yes.

>Does it provide any mechanism for
>multiple parallel control operations?

No, control operations are extremely limited and largely write-only,
so programmers tend to replace control operations by data operations
as much as possible, even when it results in a 100-fold redundancy.
E.g. to compare a 100-line array to find the first matching line, APL
programmers generally compare all chars of all lines because there is
no parallelizable way to stop at the first mismatch/match.

So to some degree, APL programmers alway did what programmers of
vector instruction set extensions do now, when they replace control by
data operations to avoid excessive latencies.

Gruss, Andreas

Rupert Pigott

unread,
Aug 28, 2005, 3:09:06 AM8/28/05
to
Nick Maclaren wrote:

> Yes. I have joined the BSI C++ panel specifically for this matter.
> When the relevant person produces a draft, I shall try to help to
> make it tight enough to use without constraining optimisation too
> much. This might be tricky ....

No kidding.

However, I firmly believe that making a language || friendly
will yield more optimisation opportunities. If nothing else
it'll give programmers a chance to do the right thing instead
of simply crossing their fingers and hoping for the best. :(

Cheers,
Rupert

Nick Maclaren

unread,
Aug 28, 2005, 5:24:11 AM8/28/05
to
In article <1125212946....@z14g2000cwz.googlegroups.com>,

Yup. But that's not all. It will restore signal handling to the state
of a usable facility in high-RAS programs, and should mean a lot less
wasted time arguing over whose bug an optimisation failure is.


Regards,
Nick Maclaren.

Andrew Reilly

unread,
Aug 28, 2005, 8:24:09 AM8/28/05
to
On Fri, 26 Aug 2005 17:52:20 +0000, Nick Maclaren wrote:
> 2) The threads have to be such that they ARE restartable. That
> imposes a pretty considerable constraint on the programming paradigm,
> and cannot practically be done in any third generation von Neumann
> language, without imposing draconian restrictions.

Why is this hard? Isn't this exactly what happens on every hardware
interrupt (over relatively short time frames) and when processes are
swapped out (over somewhat longer time frames)? If these events were
visible to the processes on current systems, there would be mayhem.[*]

Thread migration from processor to processor in SMPs amounts to restarting
too, as does process migration in clusters (although I'm not sure whether
or not that actually works in general, at least among clusters connected
by IP networks.)

[*] I have worked on a DSP platform where it wasn't (readily) possible to
save the extended precision guard bits in the floating point registers,
and so interrupts generally weren't allowed to do any floating point. The
alternative, which I ended up doing, was to allow the main-line code
to produce very slightly wrong results, with random, unreproducable
perterbations. This was not a problem in that particular context...

--
Andrew

Nick Maclaren

unread,
Aug 28, 2005, 9:31:08 AM8/28/05
to
In article <pan.2005.08.28...@areilly.bpc-users.org>,

Andrew Reilly <andrew-...@areilly.bpc-users.org> wrote:
>On Fri, 26 Aug 2005 17:52:20 +0000, Nick Maclaren wrote:
>> 2) The threads have to be such that they ARE restartable. That
>> imposes a pretty considerable constraint on the programming paradigm,
>> and cannot practically be done in any third generation von Neumann
>> language, without imposing draconian restrictions.
>
>Why is this hard? Isn't this exactly what happens on every hardware
>interrupt (over relatively short time frames) and when processes are
>swapped out (over somewhat longer time frames)? If these events were
>visible to the processes on current systems, there would be mayhem.[*]

You haven't been following the thread. The proposal isn't that threads
be interruptible and later restartable at the same point, but that
threads be restarted from some PREVIOUS point if it turns out that
they had invalid memory aliasing with another thread. A sort of
semi-automatic Alpha LL/STC.


Regards,
Nick Maclaren.

Stephen Fuld

unread,
Aug 28, 2005, 11:23:40 AM8/28/05
to

"Nick Maclaren" <nm...@cus.cam.ac.uk> wrote in message
news:deqd8m$k6i$1...@gemini.csx.cam.ac.uk...

> In article <vL%Pe.140436$5N3.1...@bgtnsc05-news.ops.worldnet.att.net>,
> Stephen Fuld <s.f...@PleaseRemove.att.net> wrote:
>>
>>> 2) Adding the type of parallelism described is VERY error-prone
>>> (for the reasons I gave), especially when implemented by the average
>>> software house (with a high turnover of monkeys). Even when I extend
>>> my own code, I often make a mistake, forget a constraint, and introduce
>>> a bug. You CAN'T document everything!
>>
>>Right. That is why I like Chris' approach. As I understand it, it still
>>guarantees correctness (at least to the same level as the single thread
>>program), and gives feedback about where the proposed increased
>>parallelism
>>failed. Comparing the before and after change reports should indicate a
>>problem and where it might be.
>
> And that is one of the things that I don't believe. Even ignoring
> non-memory interactions, there is a BIG problem with partial update
> of aggregates. If two threads update two fields in the same structure,
> it can invalidate the structure, but the hardware will think that they
> are unrelated.
>
> This doesn't just affect contiguous structures, incidentally, but
> even ones like Dirichlet tesselations, B-trees and so on. Update
> those in parallel, and one thread will be certain to invalidate the
> invariants assumed by the other while the former's update is taking
> place, leading to corruption. And, of course, the effect is a race
> condition of an evil form ....

Those are good points. I don't know how Chris and his colleagues plan to
address them and whether those plans will be successfull. But I presume
that they haven't ignored them.

>>But, again, pending more information, with Chris' approach it isn't and
>>"all
>>or nothing" thing to turn on or off parallelism. You leave in the
>>directives where it helps, and remove them where the report says it is a
>>loss. It seems to be pretty easy to "play around" and move the directives
>>around to evaluate performance, all the time still guaranteesing
>>correctness.
>
> No, that works only with the detected problems. I am referring to the
> more serious (and VERY common) ones where the result of parallelisation
> is unrepeatable wrong answers. It is very common for binary chop to
> fail to help - i.e. the problem appears and disappears more-or-less
> independently of WHICH bit of code you serialise.
>
>>> 4) In the commercial arena, multiple CPUs have been widespread for
>>> 25 years and near-universal for 15. My remark about bitter experience
>>> is based on a LOT of evidence.
>>
>>Yes, for servers, where inter task parallelism is pretty easy, but not for
>>desktops. Wheras today there are relativly few existing multi-processor
>>desktops, within say 5 years, I expect there to be very few uniprocessor
>>desktops around.
>
> The situation is no different there. Most desktops nowadays are used
> to run GUIs, often to run an application that accesses the network.
> There is a LOT of inter-task parallelisation that is feasible by a
> redesign of the components (as I said, in the network case, without
> even a change of specification).

So as multi core or multi threaded CPUs become common on desktops, are you
predicting that the changes you discussed will be implemented in order to
take advantage of the new capabilities?

Thomas Lindgren

unread,
Aug 28, 2005, 11:44:50 AM8/28/05
to

nm...@cus.cam.ac.uk (Nick Maclaren) writes:

> You haven't been following the thread. The proposal isn't that threads
> be interruptible and later restartable at the same point, but that
> threads be restarted from some PREVIOUS point if it turns out that
> they had invalid memory aliasing with another thread. A sort of
> semi-automatic Alpha LL/STC.

There is an interesting recent Stanford project working on the
subject, more or less, "transactional coherence and consistency"
(tcc.stanford.edu). Seems promising to me.

Best,
Thomas
--
Thomas Lindgren
"It's becoming popular? It must be in decline." -- Isaiah Berlin

Nick Maclaren

unread,
Aug 28, 2005, 12:13:22 PM8/28/05
to
In article <0GkQe.690068$cg1.1...@bgtnsc04-news.ops.worldnet.att.net>,

Stephen Fuld <s.f...@PleaseRemove.att.net> wrote:
>
>Those are good points. I don't know how Chris and his colleagues plan to
>address them and whether those plans will be successfull. But I presume
>that they haven't ignored them.

I am afraid that I presume that they have or that the posting that
started this thread considerably over-stated the generality of the
technique :-(

The trouble is that detecting such things in an arbitrary serial program
is halting problem equivalent ....

>So as multi core or multi threaded CPUs become common on desktops, are you
>predicting that the changes you discussed will be implemented in order to
>take advantage of the new capabilities?

They already are being. Whether they will be implemented competently,
let alone well, is another matter entirely.


Regards,
Nick Maclaren.

Carlie J. Coats

unread,
Aug 28, 2005, 4:08:04 PM8/28/05
to
Nick Maclaren wrote:
> In article <1h1yvl9.b1umwb1rr3d0mN%reeu...@few.vu.nl>,
> Kees van Reeuwijk <reeu...@few.vu.nl> wrote:
[snip...]

>>It is still necessary to communicate task input and output data. Beyond
>>toy programs that may be a large.
>
>
> True, but my experience is that they can rarely be done in parallel
> with processing, except for the case of applications that are
> essentially filters. And I fully agree that, in such cases, there
> is considerable scope for parallelisation of a very simple form!
> Whether you call it dataflow or divide-and-conquer is a matter of
> taste - it is dead easy to code correctly.
>
>
> Regards,
> Nick Maclaren.

Let me tell you about an example that is very much NOT "applications
that are essentially filters" -- coupled environmental modeling
systems. For example, consider the way we are implementing the
WRF-Chem meteorology/emissions/atmospheric chemistry coupled system.
Text-only isn't the best way to describe it, but let me try...

The individual components are themselves parallelized HPC programs
(WRF is hybrid SMP/MPI; the SMOKE components SMP). WRF feeds
meteorology data to SMOKE, which computes chemical emissions,
which then feed back into WRF-Chem, which does gridded atmospheric
chemistry and feedbacks from the atmospheric chemistry state into
the radiation balance (etc.) in the meteorology.

Since the natural parallesm is not commensurate for the various
programs, they are treated as separate internally-parallel
executables, distributed across available resources. Sparse
matrix technology is used for efficient transformation between
the "natural" domain decompositions of these programs.

The WRF-Chem executable has regular-grid based parallelism.
the meteorology part produces time stepped regularly gridded
meteorology, which is input for SMOKE atmospheric emissions
programs:

SMOKE "laypoint" computes pollutant-plume rise for every point
source stack in the domain on the basis of this meteorology.
It is naturally parallel on the set of source stacks (e.g.,
industrial plant smoke-stacks, about 1.0e6 of them in a US-wide
domain, for example...)

SMOKE "tmpbeis" computes biogenic emissions (parallel on "land
use tracts" -- GIS-computed polygons that have uniform land cover
types), on the basis of this meteorology. It is parallel on the
set of land use tracts.

SMOKE "tmpmobil" computes mobile source emissions on the basis
of this meteorology. It is parallel on the Cartesian product of
the set of roadway links, vehicle types, and vehicle years.

SMOKE "tmppoint" and "tmparea" compute "standard-week" point- and
area-source emissions. These are independent of meteorology, and
for a given study are part of the study set-up.

SMOKE "emismrg" takes outputs from "laypoint", "tmpbeis",
"tmpmobil", "tmppoint" and "tmparea", combines them using the
appropriate set of sparse matrices, and feeds the result back
to WRF-Chem for atmospheric chemistry modeling.

Both the meteorology and the atmospheric chemistry parts are classic
"Blue Book" HPCC problems. Thanks to the efficiency of the sparse
matrix transforms, the emissions model is somewhat less computational--
about 10% as expensive, which still means that you want to be running
it on ten or fifteen processors if you're doing serious air quality
forecast work :-) (and then 100 or more processors for the met and
chem... none of it is just filters.


--

Carlie J. Coats, Jr. carlie...@baronams.com
Environmental Modeling Center carlie...@ncsu.edu
c/o North Carolina State University MEAS Department
1125 Jordan Hall Campus Box 8208
Raleigh, NC 27695 phone: (919)515-7076 / fax: (919)515-7802
"My opinions are my own, and I've got *lots* of them!"

Dr. Adrian Wrigley

unread,
Aug 28, 2005, 6:44:49 PM8/28/05
to
On Fri, 26 Aug 2005 14:22:23 -0700, Rupert Pigott wrote:

> Many of the problems I have seen with "serial von-Neumann" have
> been a result of people trying to cram inherently parallel
> constructs into sequential models. Hardware types and || types
> tend to feel very frustrated by the von-Neumann tools they are
> forced to use to model and implement inherently parallel systems.


>
> People talk about the expressive power of C/C++. Sure, it's great
> for sequential stuff, but it provides precisely zero expression
> for real world constructs such as a simple timeout on I/O. This
> is *not* an isolated one-in-a-million situation, it crops up all
> the time and yet it simply is not part of the language model.
> Instead we have a bunch of libraries (eg: Pthreads) that require
> the toolchain to work outside of the language spec to implement
> correctly... That is asking for trouble, right ?

I'm not sure why Ada hasn't been mentioned here, but it was designed
to address real-time/multiprocessor/multithreaded systems. It is
particularly good for high reliability systems, even with
very complex problems.

In particular:

Timeouts on computation and I/O are trivially easy to code and use
Partition across multiple (even heterogenous) systems is amazingly simple
Multithreading is easy, with "protected" data structures etc.
No data can be aliased without it being easily known to the compiler
Code is highly portable, due to Ada's specification and design

plus there is a very high quality implementation in gcc (GNAT), which
generally runs equivalent programs about as fast or faster than gcc C.

Compared to Occam, Ada concurrency is generally used at a coarser
level, but filters/pipelines/task arrays etc. are all simple
to code, and generally work very reliably.

And of course, you get the basic necessities(?) of a rich templates
system, static and dynamic polymorphism, dynamic types,
user-defined distinct signed, unsigned and modular integers, user
defined fixed and floating point (and proper enumerated types).

Ada 95 is the mature standard offered by all compilers, and a
revision is due out shortly.

As a developer and user of Ada code in a real-time, mission critical
stock trading system, I find the compiler rejects many coding errors
that seem to plague current systems in C or C++. The result is
robust, efficient, maintainable code that does what's expected.

When I need to get a bit of extra speed out of code, increasing
the concurrency is usually quite easy with the built-in
parallelism constructs. Unfortunately, my hardware doesn't
offer as many threads as I could throw at it, so I don't explore
the limits of the parallelism available - but there is always "enough".

I think the future will produce increasingly many threads in hardware,
but only very slowly at first, because of the chicken/egg problem
with parallel software/hardware (still only an handful by 2010). There
will probably be increasingly many kludges to overcome the limitations of
software and languages. Eventually (2015), tolerably usable kludges will
become widespread, and demand for more hardware threads will spur a
"Moores Law" of threads for general purpose computing. Or maybe
there'll be a "breakthough".

Just a few thoughts.
--
Dr. Adrian Wrigley, Cambridge, UK.

Andrew Reilly

unread,
Aug 28, 2005, 7:14:40 PM8/28/05
to

Aah. Sorry, that is indeed a different kettle of fish, and I missed that
distinction. Probably too hard to do for any significant run of code
without support from explicit roll-back hardware, as Transmeta are reputed
to have included in the Crusoe.

Well, I guess you could probably do an interrupt-style state save at every
check point, and have the "speculative" code write to a scratch-pad space
for later copying back to main memory, but that sounds difficult to turn
into a winning proposition.

--
Andrew

Joe Seigh

unread,
Aug 28, 2005, 8:30:18 PM8/28/05
to
Dr. Adrian Wrigley wrote:
> I think the future will produce increasingly many threads in hardware,
> but only very slowly at first, because of the chicken/egg problem
> with parallel software/hardware (still only an handful by 2010). There
> will probably be increasingly many kludges to overcome the limitations of
> software and languages. Eventually (2015), tolerably usable kludges will
> become widespread, and demand for more hardware threads will spur a
> "Moores Law" of threads for general purpose computing. Or maybe
> there'll be a "breakthough".
>
Intel is in a bit of a marketing bind. They need "Moores Law" of threads to
start right now. So do the other vendors. Unfortunately, it will be very
slow to spread unless somebody does something to make it spread faster.
Message has been deleted

Robert Myers

unread,
Aug 29, 2005, 6:13:55 AM8/29/05
to
Apparently Nick never got to whatever remote site he had to get to to
read the Sun paper on Scout threads, because the Sun hardware does all
the necessary work for run-ahead mode, where it's _all_ speculative.
One problem, as with any speculative optimization, is deciding whether
and when it's worth doing, and *that* has to be done in hardware, too.

RM

JJ

unread,
Aug 29, 2005, 10:00:17 AM8/29/05
to

anonymous wrote:

snipping

> > johnjakson at usa ...
> > transputer2 at yahoo ....
>
> Ho Ho He He HA HA HA
>
> want to see a laugh? For nine years of informing IBM Defense and
> Washington
> about VLIW SMP MPP FORTH chip(s) formula, a /parallel processor/ google
> news search still looks like this,
>
> http://news.google.com/news?hl=en&ned=us&ie=UTF-8&q=parallel+processor&btnG=Search+News&num=100
>
> ---

If you model it as a cycle accurate C or Verilog simulation, you can
prototype it in FPGA and show it to people. If it takes 9yrs, maybe its
harder than it seems. A simple C model should only take a few months,
FPGA about same. Expect to get 100MHz or less. If you can't do either,
you are wasting your time.

>
> President Clinton is a jerk


This bit doesn't help, whether true or not.

JJ

unread,
Aug 29, 2005, 10:20:08 AM8/29/05
to

Dr. Adrian Wrigley wrote:
> On Fri, 26 Aug 2005 14:22:23 -0700, Rupert Pigott wrote:
>

snipping

> I think the future will produce increasingly many threads in hardware,
> but only very slowly at first, because of the chicken/egg problem
> with parallel software/hardware (still only an handful by 2010). There
> will probably be increasingly many kludges to overcome the limitations of
> software and languages. Eventually (2015), tolerably usable kludges will
> become widespread, and demand for more hardware threads will spur a
> "Moores Law" of threads for general purpose computing. Or maybe
> there'll be a "breakthough".
>
> Just a few thoughts.
> --
> Dr. Adrian Wrigley, Cambridge, UK.

How many threads would you consider to be too many per processor_memory
unit?

>From where I see things, 4N threads matching into a threaded memory
architecture, N follows the ratio of all ops to load/store ops, which I
hazard to be 5-10. So around 32 threads can consume the entire issue
bandwidth of a single threaded RLDRAM system with issues every few ns.
No FPUs though.

And if the architecture supports rendezvous style Occam, is ADA
concurreny mostly Occam like or mostly based on the other concurrency
features available?

David Gay

unread,
Aug 29, 2005, 11:52:32 AM8/29/05
to

nm...@cus.cam.ac.uk (Nick Maclaren) writes:
> And that is one of the things that I don't believe. Even ignoring
> non-memory interactions, there is a BIG problem with partial update
> of aggregates. If two threads update two fields in the same structure,
> it can invalidate the structure, but the hardware will think that they
> are unrelated.

I'm always tempted to consider this a problem, but the hardware detectors
under consideration detect not only write-write conflicts, but also
read-write conflicts, preventing this class of problems. If two threads
never access overlapping regions of memory, they can't conflict.

--
David Gay
dg...@acm.org

Seongbae Park

unread,
Aug 29, 2005, 12:16:09 PM8/29/05
to
Alexander Terekhov <tere...@web.de> wrote:
>
> Nick Maclaren wrote:
> [...]
>> When the relevant person produces a draft,
>
> Don't wait.
>
> http://jupiter.robustserver.com/pipermail/cpp-threads_decadentplace.org.uk
>
> regards,
> alexander.
>
> P.S. http://www.hpl.hp.com/personal/Hans_Boehm/tmp/tremblant.pdf
>
><quote>
>
> It has become increasingly clear that any specification will have
> an unavoidable impact on compiler optimization. Some currently
> common compiler optimizations need to be adapted to ensure thread
> safety. But this also reinforces the urgency for thread support in
> C++: Current implementations make it much harder than it should be
> to write correct multithreaded code.

...and a thread/memory model specification that
doesn't deal with compiler issues would be at best half-baked
and most likely yet another posix-like scheme.

However, I'm not sure how much C++ can do
without sacrificing either portability or the performance.
It would be extremely difficult to maintain both of them.
--
#pragma ident "Seongbae Park, compiler, http://blogs.sun.com/seongbae/"

JJ

unread,
Aug 29, 2005, 1:11:07 PM8/29/05
to

Perhaps the concurrent C++ should be defined as a new || clean safe
strict simplified alias ptr free subset of C++, a dialect language with
containers and interfaces to link to dusty decks of unsafe C++. It
would be fully useable in its own right for new par or seq codes but
protected by the highest walls. Such a language might naturally be
synthesizeable into hardware by default much as occam and HandelC are
which might be of interest to the HPC crowd.

Also fully able to support many real or virtual cpus, processor mapping
and any provided hardware protection per object. Over time some of that
dusty code might even get ported over to the much stricter model.

Overall the || dialect would still look like C/C++ but could be made as
rigorous (machine proofing) as one wants after all it won't get used by
the unwashed masses for another decade.

I think C|| is already taken IIRC.

my 2c

Nick Maclaren

unread,
Aug 29, 2005, 2:18:38 PM8/29/05
to
In article <1125310435.0...@g43g2000cwa.googlegroups.com>,

Robert Myers <rbmye...@gmail.com> wrote:
>>
>Apparently Nick never got to whatever remote site he had to get to to
>read the Sun paper on Scout threads, because the Sun hardware does all
>the necessary work for run-ahead mode, where it's _all_ speculative.

The only one I found was concentrated hand-waving. There may be one
with some meat on it available to the public, or even to the likes of
me, but none of my contacts have shown me one yet.

>One problem, as with any speculative optimization, is deciding whether
>and when it's worth doing, and *that* has to be done in hardware, too.

Grrk. No. You can have the compiler deciding when such activities
will help, and when they will not. That won't always fly, but is an
option worth considering.


Regards,
Nick Maclaren.

Nick Maclaren

unread,
Aug 29, 2005, 2:29:21 PM8/29/05
to
In article <s71acj0...@beryl.CS.Berkeley.EDU>,

No, not at all. Consider the following. You have a matrix A,
which is required to be positive definite at all times. A routine
R is defined that is passed A and its determinant, performs an
elementary transform of 4 elements and returns. It does NOT
recalculate the determinant, but ensures that the transform does
not make the matrix singular.

R is called in parallel on two widely separated parts of the
matrix. Both calculate the transform to make it JUST non-singular,
and their combination renders it singular!

This effect can also occur with Dirichlet tesselations, B-trees,
etc., when coded in certain ways.


Regards,
Nick Maclaren.

Dr. Adrian Wrigley

unread,
Aug 29, 2005, 3:00:57 PM8/29/05
to
On Mon, 29 Aug 2005 07:20:08 -0700, JJ wrote:

>
> Dr. Adrian Wrigley wrote:
>> On Fri, 26 Aug 2005 14:22:23 -0700, Rupert Pigott wrote:
>>
>
> snipping
>
>> I think the future will produce increasingly many threads in hardware,
>> but only very slowly at first, because of the chicken/egg problem
>> with parallel software/hardware (still only an handful by 2010). There
>> will probably be increasingly many kludges to overcome the limitations of
>> software and languages. Eventually (2015), tolerably usable kludges will
>> become widespread, and demand for more hardware threads will spur a
>> "Moores Law" of threads for general purpose computing. Or maybe
>> there'll be a "breakthough".
>>
>> Just a few thoughts.
>> --
>> Dr. Adrian Wrigley, Cambridge, UK.
>
> How many threads would you consider to be too many per processor_memory
> unit?

I think state for eight threads, saturating issue resources with
about four threads is about right - at least for the implementations
I have looked into doing over the past five or six years.
Interesting to see you have come to similar conclusions (for a
different ISA choice),

I don't think there is a "too many", provided that the engineering effort
is directed to individual thread performance, and the extra threads
don't impact that primary goal.

>>From where I see things, 4N threads matching into a threaded memory
> architecture, N follows the ratio of all ops to load/store ops, which I
> hazard to be 5-10. So around 32 threads can consume the entire issue
> bandwidth of a single threaded RLDRAM system with issues every few ns.
> No FPUs though.
>
> And if the architecture supports rendezvous style Occam, is ADA
> concurreny mostly Occam like or mostly based on the other concurrency
> features available?

Ada doesn't have 'par' and 'seq' constructs. This is a pity, since
these are relatively easy to use reliably, and can be implemented
with minimal overhead on appropriate hardware.

Ada's rendezvous are similar to Occam, I think. The Ada 83 standard
focused on these as the main form of synchronization. The current
Ada 95 added 'protected' data types, which are pretty straightforward
in use, basically locking individual records as needed.
This kind of data sharing has generally superseded rendezvous for
parallel processing in Ada. The great thing about protected data
types is that they (in conjunction with the scoping rules) completely(?)
eliminate risk of deadlock between tasks running code in program modules.
The compiler rejects source unless the implementations form a DAG.

Of course, there is a lot more to Ada concurrency than rendezvous
and protected types. The whole design is based around what
is actually needed in real-time and distributed systems, and
integrates well with things like exception handling etc.

See :
http://www.adaic.com/standards/95lrm/html/RM-9.html
for more (readable!) detail from the standard.
--
Adrian Wrigley, Cambridge, UK

Dan Koren

unread,
Aug 29, 2005, 2:58:26 PM8/29/05
to
"Nick Maclaren" <nm...@cus.cam.ac.uk> wrote in message
news:devk61$frn$1...@gemini.csx.cam.ac.uk...


More succintly: inconsistencies (or other
undesirable effects) may occur as a result
of modifying different data (sub-)sets.

dk


David Gay

unread,
Aug 29, 2005, 3:20:15 PM8/29/05
to

Hmm. R is changing A's determinant, so running two copies in parallel with
a cached determinant value is clearly incorrect. Note that this discussion
started from some serial code, so the serial code which would produce this
behaviour would presumably look like:

d = det(A)
R(A, d, part1)
R(A, d, part2)

This is presumably incorrect, if R can change d's determinant.

In a more abstract sense, an invariant broken by threads T1 and T2
accessing and modifying separate parts of a data structure can only be
observed by a thread T3 which actually looks at at least some of the data
accessed by T1 and T2.

--
David Gay
dg...@acm.org

Nick Maclaren

unread,
Aug 29, 2005, 4:30:08 PM8/29/05
to
In article <s7164to...@beryl.CS.Berkeley.EDU>,

David Gay <dg...@beryl.CS.Berkeley.EDU> wrote:
>
>Hmm. R is changing A's determinant, so running two copies in parallel with
>a cached determinant value is clearly incorrect. Note that this discussion
>started from some serial code, so the serial code which would produce this
>behaviour would presumably look like:
>
> d = det(A)
> R(A, d, part1)
> R(A, d, part2)
>
>This is presumably incorrect, if R can change d's determinant.

A's determinant! It does not change d - that is the point. The
invariant that is broken is that d is not A's determinant on entry.

Now exactly how can hardware detect that, in general, without
solving the halting problem?

>In a more abstract sense, an invariant broken by threads T1 and T2
>accessing and modifying separate parts of a data structure can only be
>observed by a thread T3 which actually looks at at least some of the data
>accessed by T1 and T2.

Yes. But that doesn't help.


Regards,
Nick Maclaren.

David Gay

unread,
Aug 29, 2005, 6:24:00 PM8/29/05
to
nm...@cus.cam.ac.uk (Nick Maclaren) writes:

> In article <s7164to...@beryl.CS.Berkeley.EDU>,
> David Gay <dg...@beryl.CS.Berkeley.EDU> wrote:
> >
> >Hmm. R is changing A's determinant, so running two copies in parallel with
> >a cached determinant value is clearly incorrect. Note that this discussion
> >started from some serial code, so the serial code which would produce this
> >behaviour would presumably look like:
> >
> > d = det(A)
> > R(A, d, part1)
> > R(A, d, part2)
> >
> >This is presumably incorrect, if R can change d's determinant.
>
> A's determinant!

Of course. A typo (ok, ok, d has a determinant, but it really should
be obvious I meant A).

> It does not change d - that is the point. The
> invariant that is broken is that d is not A's determinant on entry.

Yes, and the programmer told the system to assume that in the (sequential)
code I wrote above.

> Now exactly how can hardware detect that, in general, without
> solving the halting problem?

I don't expect even a magical auto-parallelisation system to correct bugs.
So this is really a non-sequitur.

> >In a more abstract sense, an invariant broken by threads T1 and T2
> >accessing and modifying separate parts of a data structure can only be
> >observed by a thread T3 which actually looks at at least some of the data
> >accessed by T1 and T2.
>
> Yes. But that doesn't help.

Yes it does. It means that you can't introduce unobservable data races.

--
David Gay
dg...@acm.org

Scott A Crosby

unread,
Aug 29, 2005, 8:04:21 PM8/29/05
to
On 29 Aug 2005 18:29:21 GMT, nm...@cus.cam.ac.uk (Nick Maclaren) writes:

>>> And that is one of the things that I don't believe. Even ignoring
>>> non-memory interactions, there is a BIG problem with partial update
>>> of aggregates. If two threads update two fields in the same structure,
>>> it can invalidate the structure, but the hardware will think that they
>>> are unrelated.

IF the hardware does proper memory disambiguation and doesn't reorder
reads before writes to the same address, then it would detect this as
a data dependency and will eventually abort execution of the thread
that speculatively broke the invariant.

Concretely, threads A&B mutating some structure, B occurs after A in
program order. A&B do some work, then B does a speculative read of
location X and checks an invariant on it, some time later A writes to
location X. At that point, the speculative execution of B after that
read may be wrong and a recovery strategy would be invoked.

Maybe I'm confused about your question. Can you explain how two
threads that access completely disjoint subsets of memory can break an
invariant.

> No, not at all. Consider the following. You have a matrix A,
> which is required to be positive definite at all times. A routine
> R is defined that is passed A and its determinant, performs an
> elementary transform of 4 elements and returns. It does NOT
> recalculate the determinant, but ensures that the transform does
> not make the matrix singular.

> R is called in parallel on two widely separated parts of the
> matrix. Both calculate the transform to make it JUST non-singular,
> and their combination renders it singular!
>

It looks to me that you have a data dependency, in that:

d' = R(A,d)
d'' = R(A,d')

So these tasks can't run in parallel unless you require that the
determinate doesn't change or d=d'. If you do require d=d' and the two
invocations of R access (*access* not just mutate) disjoint subsets of
locations, then this situation is buggy and will return a singular
matrix whether the two copies of R are run serially or in parallel.

Scott

Scott A Crosby

unread,
Aug 29, 2005, 8:11:07 PM8/29/05
to
On 29 Aug 2005 18:18:38 GMT, nm...@cus.cam.ac.uk (Nick Maclaren) writes:

> The only one I found was concentrated hand-waving. There may be one
> with some meat on it available to the public, or even to the likes of
> me, but none of my contacts have shown me one yet.

There is the Multiscalar research that was mentioned at the very
beginning of the thread, they have a raft of publications up at
http://www.cs.wisc.edu/~mscalar/publications.html including Manoj
Franklin's Ph.D. thesis on 'The Multiscalar Architecture'.

Scott

anonymous

unread,
Aug 29, 2005, 8:31:07 PM8/29/05
to

Mr. Moore already published his SMP MPP FORTH specification by the year
2000. I, mostly, finished mine in 1999. They merge together
beautifully. The news from IBM is slightly delayed.

---

Nick Maclaren

unread,
Aug 30, 2005, 4:30:32 AM8/30/05
to

In article <s711x4c...@beryl.CS.Berkeley.EDU>,

David Gay <dg...@beryl.CS.Berkeley.EDU> writes:
|> > >
|> > >This is presumably incorrect, if R can change d's determinant.
|> >
|> > A's determinant!
|>
|> Of course. A typo (ok, ok, d has a determinant, but it really should
|> be obvious I meant A).

Oh, yes, but the confusion is whether that means A's determinant
at the time or as stored in d!

|> > It does not change d - that is the point. The
|> > invariant that is broken is that d is not A's determinant on entry.
|>
|> Yes, and the programmer told the system to assume that in the (sequential)
|> code I wrote above.

Precisely.

|> > Now exactly how can hardware detect that, in general, without
|> > solving the halting problem?
|>
|> I don't expect even a magical auto-parallelisation system to correct bugs.
|> So this is really a non-sequitur.

But it's NOT a bug in the serial code, because the invariant is
maintained by the sequential consistency that is broken by the
parallelisation.

|> > >In a more abstract sense, an invariant broken by threads T1 and T2
|> > >accessing and modifying separate parts of a data structure can only be
|> > >observed by a thread T3 which actually looks at at least some of the data
|> > >accessed by T1 and T2.
|> >
|> > Yes. But that doesn't help.
|>
|> Yes it does. It means that you can't introduce unobservable data races.

Well, I suppose, in theory. I thought of designing such a constraint
into a language and it didn't take me long to realise that it was a
stupid idea. The problem is that almost all imperative programs have
a LOT of threadwise constant global data that is updated under
controlled circumstances. Distinguishing invalid from valid use is
what can't be done automatically.


Regards,
Nick Maclaren.

Nick Maclaren

unread,
Aug 30, 2005, 4:41:32 AM8/30/05
to

In article <oydzmr0...@bert.cs.rice.edu>,
Scott A Crosby <scr...@cs.rice.edu> writes:

In article <oydzmr0...@bert.cs.rice.edu>, Scott A Crosby <scr...@cs.rice.edu> writes:
|> On 29 Aug 2005 18:29:21 GMT, nm...@cus.cam.ac.uk (Nick Maclaren) writes:
|>
|> >>> And that is one of the things that I don't believe. Even ignoring
|> >>> non-memory interactions, there is a BIG problem with partial update
|> >>> of aggregates. If two threads update two fields in the same structure,
|> >>> it can invalidate the structure, but the hardware will think that they
|> >>> are unrelated.
|>
|> IF the hardware does proper memory disambiguation and doesn't reorder
|> reads before writes to the same address, then it would detect this as
|> a data dependency and will eventually abort execution of the thread
|> that speculatively broke the invariant.

No, not at all. How do two threads tell that two separate data
areas are really part of the same structure? Consider the examples
I gave of a B-tree, a Dirichlet tesselation and a matrix.

|> Concretely, threads A&B mutating some structure, B occurs after A in
|> program order. A&B do some work, then B does a speculative read of
|> location X and checks an invariant on it, some time later A writes to
|> location X. At that point, the speculative execution of B after that
|> read may be wrong and a recovery strategy would be invoked.
|>
|> Maybe I'm confused about your question. Can you explain how two
|> threads that access completely disjoint subsets of memory can break an
|> invariant.

You are making the assumption that there is a suitable, quick test
for the invariant. That is reasonable for a guideline on how to
program, but which CANNOT be done automatically by either
hardware or software, and sometimes cannot be done at all. In
the cases I gave as examples, there is no such test.

|> It looks to me that you have a data dependency, in that:
|>
|> d' = R(A,d)
|> d'' = R(A,d')
|>
|> So these tasks can't run in parallel unless you require that the
|> determinate doesn't change or d=d'. If you do require d=d' and the two
|> invocations of R access (*access* not just mutate) disjoint subsets of
|> locations, then this situation is buggy and will return a singular
|> matrix whether the two copies of R are run serially or in parallel.

Yes, but remember that the original program has a SINGLE copy, and
the invariant is maintained by the sequencing. The bug is caused
by the attempt to split the program in order to parallelise it.

This is my point about the inability of any such technique to
handle programs written using existing paradigms. It is perfectly
possible to autoparallelise languages where independence of
functional units (including order independence) is already a
feature and where such splitting has already been done.


Regards,
Nick Maclaren.

Nick Maclaren

unread,
Aug 30, 2005, 8:15:18 AM8/30/05
to

In article <43107EC3...@web.de>,
Alexander Terekhov <tere...@web.de> writes:
|> Nick Maclaren wrote:
|> [...]

|> > When the relevant person produces a draft,
|>

Well, yes, but that seems to be the usual weebling about threads.
I have very little direct interest in such things, and do not
have time to follow every debate on everything.

The point about the memory model is that it is a prerequisite
for ANY form of parallelism in C++ - and it is specifically a
draft of that that I am waiting for.


Regards,
Nick Maclaren.

Jan Vorbrüggen

unread,
Aug 30, 2005, 8:26:00 AM8/30/05
to
> But it's NOT a bug in the serial code, because the invariant is
> maintained by the sequential consistency that is broken by the
> parallelisation.

In that case, d (the current value of the determinant) is an INOUT
formal parameter, and (for instance) occam2's usage/alias checking
will detect this and complain about running two instances of R in
parallel (this is assuming it can detect that the parts of A being
modified are disjoint, which a sub-array syntax such as that of F90
could easily specify).

Jan

factory

unread,
Aug 30, 2005, 8:34:23 AM8/30/05
to
In article <1124982035.647275.201420
@z14g2000cwz.googlegroups.com>, ana...@earthlink.net
says...
> Torben Ægidius Mogensen wrote:
> > IMO we need either to use
> > languages where parallelism is either explicit (and verifiably
> > independent)
>
> It isn't much of an exaggeration to say that programmers can't
> correctly use explicit parallelism.
>
> Yes, they think that they can, but most parallel programs with
> non-trivial interaction shows that they're wrong. (Yes, those
> programs seem to work much of the time, but they fail often
> enough to demonstrate that we're mostly lucky.)

We'll see, with the next generation of consoles
programmers will be exposed to a machine with which they
will have to use some form of parallelism to stay
competative.

- Factory

Nick Maclaren

unread,
Aug 30, 2005, 8:44:11 AM8/30/05
to

In article <3nj1igF...@individual.net>,

No, it ISN'T!!!!!

I did NOT, repeat NOT, specify that the operation recalculated
the determinant, nor that the requirement was any more than the
variable contained the calculated value of the determinant ON
ENTRY.

The point about a huge number of typical programs (perhaps the
majority) is that there is a lot of information held, implicitly,
in the location in the code. Because the serial von Neumann
model states that there is a single position at all times, this
is reliable. In particular, in this case, the serial von Neumann
model guarantees the sequencing necessary for the assumption of
the program to be reliable, and the serial program to be correct.

However, as soon as you introduce ANY form of parallelisation or
even reordering, that guarantee fails. My point is that you
CANNOT take a program written to the serial von Neumann paradigm
and split it up, without analysing it for the use of such hidden
(implicit) information. And that such analysis is halting problem
equivalent.

And THAT is why I say that NONE of these approaches to executing
programs help with parallelising arbitrary existing code. There
is NO option (in general) to rewriting them in another paradigm.


Regards,
Nick Maclaren.

Torben Ægidius Mogensen

unread,
Aug 30, 2005, 9:08:15 AM8/30/05
to
nm...@cus.cam.ac.uk (Nick Maclaren) writes:


> Consider the following. You have a matrix A,
> which is required to be positive definite at all times. A routine
> R is defined that is passed A and its determinant, performs an
> elementary transform of 4 elements and returns. It does NOT
> recalculate the determinant, but ensures that the transform does
> not make the matrix singular.
>
> R is called in parallel on two widely separated parts of the
> matrix. Both calculate the transform to make it JUST non-singular,
> and their combination renders it singular!

The solution to this is to not allow explicit destructive updates -- R
returns a new matrix but does not overwrite the old. If one call to R
uses the result of the other, they can't be run in parallel (unless
you synchronize on the value of the first call).

Now, you are not likely to actually want to make copies of large
matrices, so you declare the matrix as single-threaded and make the
compiler check that the old version of the matrix will always be dead
when a new is produced. This is fairly old stuff in type theory
(google for "uniqueness types" and "linear types"). See for example
http://citeseer.ist.psu.edu/ennals04linear.html, which describes a use
of linear types for parallel processing.

Torben

Nick Maclaren

unread,
Aug 30, 2005, 9:15:44 AM8/30/05
to

In article <7zu0h7y...@app-6.diku.dk>,

tor...@app-6.diku.dk (=?iso-8859-1?q?Torben_=C6gidius_Mogensen?=) writes:
|>
|> The solution to this is to not allow explicit destructive updates -- R
|> returns a new matrix but does not overwrite the old. If one call to R
|> uses the result of the other, they can't be run in parallel (unless
|> you synchronize on the value of the first call).

No. ONE solution to that is .... There are lots of others! But,
yes, it is a solution, and was hammered as the One True Solution
back in the 1960s and 1970s by the functional programming brigade.
Actually, it is a very GOOD solution, if its cost isn't too high,
but that still doesn't make it the only one!

|> Now, you are not likely to actually want to make copies of large
|> matrices, so you declare the matrix as single-threaded and make the
|> compiler check that the old version of the matrix will always be dead
|> when a new is produced. This is fairly old stuff in type theory
|> (google for "uniqueness types" and "linear types"). See for example
|> http://citeseer.ist.psu.edu/ennals04linear.html, which describes a use
|> of linear types for parallel processing.

Yes. It isn't very practical in HPC, but would work pretty well
in more normal cases. However, it only confirms my point:

If you want to introduce significant amounts of parallelism,
you have GOT to get away from the serial von Neumann paradigm.
And NO solution that claims to parallelise arbitrary programs
written in that paradigm will EVER work.

There are DOZENS of better paradigms, but the one that dominates
the world of IT is the ancient serial von Neumann one.


Regards,
Nick Maclaren.

Ken Hagan

unread,
Aug 30, 2005, 9:46:53 AM8/30/05
to
Nick Maclaren wrote:
>
> The point about a huge number of typical programs (perhaps the
> majority) is that there is a lot of information held, implicitly,
> in the location in the code.

I can't imagine a program that doesn't store *some* information
in the location in the code. If such a thing existed, it wouldn't
matter which instruction it executed next.

Nick Maclaren

unread,
Aug 30, 2005, 10:14:48 AM8/30/05
to

In article <df1o0t$k4f$1$8302...@news.demon.co.uk>,

Indeed. Prolog and Basic, anyone?

However, there is a difference between information that is stored
in the location, and is obviously so, and that which is hidden
except to a thorough analysis. And it is the latter that causes
the trouble.


Regards,
Nick Maclaren.

It is loading more messages.
0 new messages