[erlang-questions] Classification of concurrency patterns

24 views
Skip to first unread message

Joe Armstrong

unread,
May 12, 2010, 3:31:00 AM5/12/10
to Erlang
I'm interested in trying to classify and name a number of different
parallelization strategies.

We often use expressions like MPC "message passing concurrency" and
SPC "shared state concurrency" this is not what I'm interested in.

I'm interested in classifying and naming the strategies we use to
write parallel code. We could *implement* these strategies using MPC
or SSC.

I've chosen my own names for these below, I have no idea if there are
standard names for these classifications, this is what I've called them:

1) Divide and conquer concurrency
2) Pipeline concurrency
3) Map reduce concurrency
4) Identical Job concurrency
5) Grid concurrency
6) Job queue concurrency

They mean the following:

1) Divide and conquer concurrency

Divide the problem into K jobs which can be performed in parallel.
Perform the jobs
Recombine the results

2) Pipeline concurrency

Divide the problem into K jobs that can be performed in a pipeline.
The output of the first job must be the input of the second and so on.
Run all the steps in parallel

3) Map reduce concurrency

This is similar to 1). Only the recombination step involves
merging results that
have the same Key.

4) Identical Job concurrency

Here the problem does not have to be split into K parts, we
already have N identical
jobs to do (identical means jobs that will consume similar resources)

There is no recombination step. There are N results.

5) Grid concurrency

Divide the problem into N*M identical jobs that can be solved on a
NxM grid of processors.
Each processor can only talk to its direct neighbors

6) Job queue concurrency

Divide the problem into a set of named workers (who do jobs). Each
worker has an input
job queue. Workers consume jobs from their job queues and send the
result to other workers.

Note: [1] Paul Morrison calls this flow based programming
[2] This is AMQP


----


That was my first attempt at naming and classifying these patterns.

I'd be interested to know alternative names for the same things, and
names of systems that employ
the above strategies, and of course missing strategies.

If we could name and classify these things then we could start writing
gen_divide_and_conquer
gen_pipe etc.

/Joe

________________________________________________________________
erlang-questions (at) erlang.org mailing list.
See http://www.erlang.org/faq.html
To unsubscribe; mailto:erlang-questio...@erlang.org

--
You received this message because you are subscribed to the Google Groups "Erlang Programming" group.
To post to this group, send email to erlang-pr...@googlegroups.com.
To unsubscribe from this group, send email to erlang-programm...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/erlang-programming?hl=en.

Vlad Dumitrescu

unread,
May 12, 2010, 6:35:09 AM5/12/10
to Joe Armstrong, Erlang
Hello Joe,

On Wed, May 12, 2010 at 09:31, Joe Armstrong <erl...@gmail.com> wrote:
> I'm interested in trying to classify and name a number of different
> parallelization strategies.

> 1) Divide and conquer concurrency
> 2) Pipeline concurrency
> 3) Map reduce concurrency
> 4) Identical Job concurrency
> 5) Grid concurrency
> 6) Job queue concurrency

IMHO, #1, #3 and #4 are the same thing, differently parametrized. Do
you think it is important to get them separate identities?

Regarding #5, a more general description would involve a grid that is
not restricted to two dimensions. I think I remember seeing papers
about 3D grids.

#6 can be seen as a generalization of #2, where in the pipeline the
"next" worker is hardcoded and the structure is linear. A pipeline may
or may not use buffering queues between stages.

The way I see it, there are two main categories:
* dividing the data set into smaller chunks and processing them in
parallel (#1,3,4)
* dividing the work into specialized chunks and letting the data flow
to next chunk (#2,6)

These two can also be combined in several ways, giving for example #5
(usually each processor has its own chunk of input data, but data can
flow to neighboring processors). Or each processing unit in #1 could
involve a pipeline.

It depends on the purpose of defining these categories whether a
shorter, more general list or a more detailed one is required.

best regards,
Vlad

AJ Heller

unread,
May 12, 2010, 3:34:17 PM5/12/10
to Vlad Dumitrescu, Joe Armstrong, Erlang
I would certainly appreciate a single reference on high-level
distributed/parallel design patterns. I can find some of the patterns
you list in a handful of books and papers, but not all of them in one
place. Over the past year or so, I've searched for a compiled list
like you're suggesting, and I haven't turned up much. So even if such
a thing does exist, I'm probably not the only one who hasn't found it
yet.

While having implementations of these patterns would be great, I'd
just like to add that if a list of patterns must be compiled to begin
with, that list would be of great benefit to a broader audience than
just the erlang community. I'm thinking in particular of a GoF-style
of pattern analysis wherein you offer lists of common uses, strengths,
and drawbacks for each pattern. A work like that on high-level
parallel/distributed design patterns would be invaluable.

@Vlad: While I empathize with you, I do think the distinctions are
important. Taking myself as an example, I can still look at most of
the GoF patterns and argue that they're each a special case of the
Strategy Pattern (though I've learned to stop arguing about this). To
people that work with these patterns on a daily basis, the nuances
between the patterns are significant. So even though I don't yet grasp
many of the nuances myself, I can appreciate that other people do. So
long as there are significant differences between appropriate uses,
strengths, and pitfalls for a pattern and its variation, I'd argue
that the variation probably deserves its own entry in the list. Even
if they all still look like divide and conquer to me :) .

Vlad Dumitrescu

unread,
May 12, 2010, 4:32:32 PM5/12/10
to AJ Heller, Joe Armstrong, Erlang
Hi AJ,

On Wed, May 12, 2010 at 21:34, AJ Heller <a...@drfloob.com> wrote:
> @Vlad: While I empathize with you, I do think the distinctions are
> important. Taking myself as an example, I can still look at most of
> the GoF patterns and argue that they're each a special case of the
> Strategy Pattern (though I've learned to stop arguing about this). To
> people that work with these patterns on a daily basis, the nuances
> between the patterns are significant. So even though I don't yet grasp
> many of the nuances myself, I can appreciate that other people do. So
> long as there are significant differences between appropriate uses,
> strengths, and pitfalls for a pattern and its variation, I'd argue
> that the variation probably deserves its own entry in the list. Even
> if they all still look like divide and conquer to me :) .

I don't disagree. The only item that really isn't a separate one is #4
(identical jobs) which is identical to #1 - it only happens that the
number of jobs is the same as the number of processing units that I
have available here and now. On most other machines it would be
classified as #1.

best regards,
Vlad

Mariano Guerra

unread,
May 12, 2010, 8:32:51 PM5/12/10
to Vlad Dumitrescu, AJ Heller, Joe Armstrong, Erlang
On Wed, May 12, 2010 at 5:32 PM, Vlad Dumitrescu <vlad...@gmail.com> wrote:
> Hi AJ,
>
> On Wed, May 12, 2010 at 21:34, AJ Heller <a...@drfloob.com> wrote:
>> @Vlad: While I empathize with you, I do think the distinctions are
>> important. Taking myself as an example, I can still look at most of
>> the GoF patterns and argue that they're each a special case of the
>> Strategy Pattern (though I've learned to stop arguing about this). To
>> people that work with these patterns on a daily basis, the nuances
>> between the patterns are significant. So even though I don't yet grasp
>> many of the nuances myself, I can appreciate that other people do. So
>> long as there are significant differences between appropriate uses,
>> strengths, and pitfalls for a pattern and its variation, I'd argue
>> that the variation probably deserves its own entry in the list. Even
>> if they all still look like divide and conquer to me :) .
>
> I don't disagree. The only item that really isn't a separate one is #4
> (identical jobs) which is identical to #1 - it only happens that the
> number of jobs is the same as the number of processing units that I
> have available here and now. On most other machines it would be
> classified as #1.

I think that the first is a reference to a job where there is only one
result and the job is divided to handle different parts and then
recombine them.

for example you are rendering one frame with ray tracing and divide
the rendering in N horizontal parts and then recombine the result to
get the full frame.

the number 4 is more like having to render 24 frames to make 1 second
of an animation, you divide the rendering, but you get 24 different
results, you don't have to combine them later*

* except if you want to make a video, maybe this is not a good example
after all :D

Jay Nelson

unread,
May 14, 2010, 12:15:47 PM5/14/10
to Erlang Questions
Joe's list seems to me to address only one general category: parallel
computation. The distinctions in the list may or may not overlap,
but they are all computational parallelisms. Other terms for
computational concurrency that come to mind include PubSub and
Scatter / Gather (which are variants of some that were already listed).

The supervisor pattern only works with concurrency. It is an
Organizational Pattern rather than a Computational Pattern.

I would add the following types of concurrency as a separate category:

Organizational Concurrency
- Supervisor hierarchies
- Serial servers (ala gen_server)
- Event-driven reactors (spawn on each event)
- Process-striped buffers (gen_stream**)

Supervisors detect failure of other processes, therefore they must be
implemented using concurrency. The intelligence of system management
is distributed among the supervisors and defines the behavior of the
system under stress.

Serial servers are used to isolate state and provide distributed
asynchronous access to state (or a computational service) for a
series of unrelated clients. Servers also can be used to impose a
serial order on events, or provide distributed rendezvous control.

Event-driven concurrency is used in web servers so that there is no
contagion of errors from one client to another. It also ensures
memory leaks are not likely to affect the system since spawned
requests are short-lived. Any system which has variable response
time in reaction to events can benefit from concurrency by
overlapping computations in time.

**This weekend I will be checking in gen_stream which uses process-
striped buffering as a new example of organizational concurrency to
provide efficient access to a slow data stream by prefetching the
data. A large piece of data is segmented across processes in a round-
robin fashion much like RAID uses striping of disks.

jay

Eric Newhuis

unread,
May 14, 2010, 1:22:20 PM5/14/10
to Jay Nelson, Erlang Questions
Brilliant.
> round-robin fashion much like RAID uses striping of disks.

Guy Wiener

unread,
May 15, 2010, 4:49:18 AM5/15/10
to Joe Armstrong, Erlang Questions
Hello Joe, and everyone on the Erlang list,
A colleague of mine commented that this list contains only transactional
patterns (i.e, take data, process it and output it), but no reactive
patterns (i.e, a system that responds to events over time without a single
output or a predefined point of termination).

One can observe several patterns for parallel reactive architectures, namely
by the way the process collaborate:

1) Clique - Every process communicate with almost any other process. Each
message includes the sender, to distinguish from other messages.
2) Broadcast - Processes do not collaborate with other specific processes,
but broadcast messages without expecting a specific reply.
3) Subscribe/Notify - Processes broadcast messages to other pre-subscribed
processes.
4) Star - "Peripheral" processes collaborate through a small group
(potentially just one) of "central" processes, who decide for the rest what
to do.

One can also observe between two kinds of parallel reactive behavioral
protocols:

1) Independent - Each process decides for itself what is the next step,
regardless of other processes.
2) Consensus - All processes try to reach some agreement on the next step
(often using a broadcast or star pattern).

Would you consider these to be parallel patterns?

Best,
Guy Wiener.

Henning Diedrich

unread,
May 15, 2010, 10:29:52 AM5/15/10
to Erlang Questions
Joe's patterns had one, #5/Grid, with any communication between jobs.

The general train of thought starts out purely functional there I guess
with communication between jobs on the other end of the scale.

Maybe noting the difference between functional and non-functional "jobs"
as underlying building blocks of a pattern can clarify thinking.

You could have functional jobs, broadcasting jobs, receiving jobs,
implicitly and overtly blocking jobs. The latter may be a good early
indicator of anti-patterns were they appear in Joe's computing patterns,
while being potentially the norm in Guy's scenarios. They'd wait for
input to react to.

I can think of patterns where all processes are connected to
out-of-system resources and receiving signals from there, to then act.
Others, e.g. simulations, will receive no input on the part of the jobs
but still be non-functional in that they communicate among each other as
prescribed in the Grid or in the list below.

From Guy's list, all of Joe's, except the Grid, seem to be Stars. The
Grid itself seems to belong on Guy's list because of inter-job
communication but then again not because it is not meant to be
outside-world-reactional as far as I understand it.

Jobs
- functional
- broadcasting
- receiving (and possibly broadcasting)
- blocking
- - - - - - - -
- pure computation (all input state is available before start)
- receiving from other jobs
- receiving (even indirectly) from out of system

Henning

Jay Nelson

unread,
May 15, 2010, 1:46:49 PM5/15/10
to Erlang Questions
In an earlier thread I alluded to not understanding how to express
concurrency patterns. My belief was that they are more complicated
to categorize and differentiate than design patterns of OO because
they cross the architecture rather than just a single software
module. I wasn't sure what elements were necessary to describe them
and think that is starting to show in the discussion on this thread.
I think it will take a little analysis and negotiation to arrive at a
reasonable description.

Thus far, this thread has mainly enumerated specific instances of
architectures without looking at the attributes and features of the
entire problem space with a goal of identifying the measures which
would differentiate approaches. A proper analysis should result in
some as yet uncommon or undiscovered concurrency patterns by filling
in missing examples on a diagram of possibilities.

To make progress we need to look at the range of features needed to
describe the domain of concurrency patterns. Here are some key
attributes of a concurrent architecture that I can think of offhand,
along with measures of each that could be used to classify them
(considering only process-based concurrency), the main point being
different ways of describing the processes that participate in a
pattern:

Number of processes
- One => all OO design patterns fall in here
- Few => typically task oriented processes with dedicated
functionality
- Dozens => typically partitioned systems with worker pools
- Many => event driven spawning or grid-like computations
- Dynamic => distributed hashes, communal participatory
computation (SETI)

Lifetime of processes
- Static/Infinite => dedicated services essential to working system
- Enduring => self-recovering stable functions (reconnect on
disconnect),
resource pool workers
- Temporary => services needed on demand, caching
- Ephemeral => reactive workers or task-oriented computation
- Dormant => resource conserving (hibernation)

Scheduling of processes
- Once => static service
- Periodic => cron-like services, Comet / AJAX polling
- On Demand => event-driven services

Static Organization of processes
- Network computation topologies => hypercube, star, mesh, ring,
etc.
- Functional relationships => supervisors, serial servers, pub/
sub, pool
- Partitioning => isolation of failure, allocation of resources

Dynamic Organization of processes
- Load adaptive => efficient resource allocating services
- Demand driven => swarm computing
- Migratory => mobile agents, resource discovery, continuation
patterns
- Coordinated => failover / takeover, command and control, leader
election
- Participatory => distributed hash, SETI

Data flow through processes
- Static => networked dataflow, traditional partitioned
- Affinity-based => data flocking for efficient computation
- Routed / Queue distributed => pub/sub, worker pools
- Activation propagation => traditional dataflow, pipeline
- Central distributed => serial server, rendezvous
- Scatter / Gather => map reduce, grid
- Network overlay => SIMD grid

Location of processes
- Static => traditional concurrency
- Dynamic => mobile agents, adaptive load systems
- Admin directed => command and control concurrency
- Resource associated => adaptive load, efficient data computation

Process-Mapped Resource Access
- Continuous => traditional
- Repeated static => pool workers
- Repeated dynamic => event driven, load adaptive
- One-shot => caching, dataflow, tcp proxy

As you can see, the OO patterns eliminate a few dimensions of the
range of possibilities and therefore are presumably easier to describe.

Not all of these features are necessary simultaneously to classify
the entire domain (as seen by some of my overlapping descriptions),
arguments exist for selective sets of them or combinations. I think
that because a concurrent system is more complex and dynamic, it is
necessary to describe an architecture with both static attributes
(typically called "the architecture") and dynamic attributes (process
lifecycle, dataflow) over time.

I think this is a highly relevant thread, and an important topic for
the Concurrency Oriented Programming Language (COPL) community to
come to agreement on. Doing so will advance software and training at
the pace of the multi-core options that are becoming available. I
think this outline could form the basis of a collaborative wiki to
discover a better classification hierarchy.

jay


________________________________________________________________
erlang-questions (at) erlang.org mailing list.
See http://www.erlang.org/faq.html
To unsubscribe; mailto:erlang-questio...@erlang.org

Raoul Duke

unread,
May 16, 2010, 4:43:27 AM5/16/10
to Erlang Questions
On Sat, May 15, 2010 at 10:46 AM, Jay Nelson <j...@duomark.com> wrote:
> To make progress we need to look at the range of features needed to describe
> the domain of concurrency patterns.

+ a bazillion.

in other words, i sometimes think that if you can't draw it up as a
matrix/grid that shows the possible options, you don't understand the
space well enough. it is especially useful/important because - as one
of the Sutherland brothers said - once you can render it thus, you can
see which parts of the grid are empty, and ponder why.

sincerely.

Joe Armstrong

unread,
May 18, 2010, 5:46:41 AM5/18/10
to Jay Nelson, Erlang Questions
There's another problem.

What abstractions do we use to express concurrency in. We can use
spawn send, receive and so on
to model any of the concurrency patterns I mentioned. But these are
rather low level.

pmap and pipe are much higher level.

pmap works just like map only the arguments are evaluated in parallel.

now pmap(fun(X) -> 2*X end, List) doesn't make much sense on a multicore
the overhead of splitting and joining the arguments outweighs the benefit.

pmap(fun(X) -> compile(X) end, List) might make sense - since the work
done is large
compared to the setup overhead.

But wait a moment, do we want to pmap over all elements in a list.
pmap on a thousand element list
will create 1000 parallel processes. On a quad core it might would
make sence to break the list into 4 chunks
(not 1000) - or 8 chunks on a 8-core.

There's no point in expressing the concurrency in an algorithm if the
hardware can't do it anyway
(or perhaps there is?)

Seems like we also need a topology layer te separate the pmap
abstraction from the characteristics
of the underlying hardware, taking into account physical
characteristics of the system. We might
even need a micro-scheduler, just for managing a pmap operation.

Tricky stuff

/Joe
>   - Functional relationships => supervisors, serial servers, pub/sub, pool

Geoffrey Biggs

unread,
May 18, 2010, 5:54:41 AM5/18/10
to erlang-q...@erlang.org
On 18/05/10 18:46, Joe Armstrong wrote:
> There's no point in expressing the concurrency in an algorithm if the
> hardware can't do it anyway
> (or perhaps there is?)

I believe there is. It may not be directly applicable to hardware now,
but it has a high chance of being applicable at some point in the
future. Just make sure it's *implementable* now, even if we don't get
all the benefits that having native hardware support would give, like
true concurrency.

The catch here is that your example of pmap breaking into different
sizes depending on the number of cores is the anti-thesis of what a
pattern should be. It's specifying implementation details. I think that
the patterns should be more abstract than "the task is split up 4 times"
- a better description is "the task is split up n times," with n
determined by the pattern user for their specific needs - hopefully
based on some benchmarking.

Geoff

Guy Wiener

unread,
May 18, 2010, 8:29:07 AM5/18/10
to Erlang Questions
IMHO we need to distinguish between two things: Dependencies and Resources.

Dependencies in a pattern describe the process topology under optimal
circumstances. For example, in pmap, there is no dependency between the
processes. In a pipe, each process depends on the previous. In a star,
processes depend on a central, long-running process, etc.

Resources, as Joe and Geoff mentioned, describe what is available to the
application - Namely, the number of processors and their network topology.

The actual instance of the application is the combination: Assigning
available resources according to the dependencies. Sometime there are
un-used processes (for example, if the pipe is shorter then the available
processors). Sometimes the actual assignment is less then the optimal (for
example, if there are more independent processes then available processors).

Guy.

On Tue, May 18, 2010 at 12:54 PM, Geoffrey Biggs
<geoffre...@aist.go.jp>wrote:

Sean Hernandez

unread,
May 18, 2010, 12:11:45 PM5/18/10
to Guy Wiener, Erlang Questions
I believe more comunications patterns can be found in the following
technical report,

The Landscape of Parallel Computing Research: A View from Berkeley
http://www.eecs.berkeley.edu/Pubs/TechRpts/2006/EECS-2006-183.html

Robert Virding

unread,
May 18, 2010, 8:39:42 PM5/18/10
to Joe Armstrong, Jay Nelson, Erlang Questions
On 18 May 2010 11:46, Joe Armstrong <erl...@gmail.com> wrote:
> ...
>
> There's no point in expressing the concurrency in an algorithm if the
> hardware can't do it anyway
> (or perhaps there is?)

Yes, there is. If the algorithm is best expressed in a concurrent way
then you should definitely do so, even if the hardware can't do it.
You are saying what you mean. Or it can be a hint to the
implementation, for example using pmap instead of map means that you
feel that the arguments can be evaluated in parallel, it is then up to
the implementation/hardware to do it if possible.

Robert

Joe Armstrong

unread,
May 19, 2010, 3:51:54 AM5/19/10
to Robert Virding, Jay Nelson, Erlang Questions
On Wed, May 19, 2010 at 2:39 AM, Robert Virding <rvir...@gmail.com> wrote:
> On 18 May 2010 11:46, Joe Armstrong <erl...@gmail.com> wrote:
>> ...
>>
>> There's no point in expressing the concurrency in an algorithm if the
>> hardware can't do it anyway
>> (or perhaps there is?)
>
> Yes, there is. If the algorithm is best expressed in a concurrent way
> then you should definitely do so, even if the hardware can't do it.
> You are saying what you mean. Or it can be a hint to the
> implementation, for example using pmap instead of map means that you
> feel that the arguments can be evaluated in parallel, it is then up to
> the implementation/hardware to do it if possible.
>
> Robert
>

Well of course you're right, in a way and wrong. It's actually the old
argument about
top-down programming in a new guise. In top-down programming you're
not supposed to know
what the hardware can actually do. But a quick peep under the covers
isn't a bad idea
if you want to write efficient code.

My current problems are about efficiency and not elegance. The elegant
solution is not
fast enough. There is no apparent natural concurrency to exploit.
Sequential code
that is not fast enough must be speeded up - so we have to find bits
of concurrency in
the middle of the sequential code and map it onto exiting hardware
(warts and all).

Now I guess the high road to do this would be to describe the problem
in a super high level
and then use correctness preserving transformations with hints as to
the pragmatics of
the underlying hardware. The problem, here is that we want the
solution a year ago, and
nobody knows how to walk the high road (well not quite, but this road
is also long and winding,
more of a trail in the forest than a road)

So far explicit reasoning "we have 4 cores, let's special case it like
this" is the only approach
that yields faster systems - we don't have enough experience to
generalise the results and
which is worse, each new generation of multi-cores has different
properties (in terms of inter-core
message passing times etc. ...)

We want to make the stuff go faster on the hardware we have, not write
elegant SW for
hardware that doesn't exists.

(( Do both then - Yea ))

/Joe

Vlad Dumitrescu

unread,
May 19, 2010, 4:12:50 AM5/19/10
to Joe Armstrong, Robert Virding, Jay Nelson, Erlang Questions
On Wed, May 19, 2010 at 09:51, Joe Armstrong <erl...@gmail.com> wrote:
> On Wed, May 19, 2010 at 2:39 AM, Robert Virding <rvir...@gmail.com> wrote:
>> On 18 May 2010 11:46, Joe Armstrong <erl...@gmail.com> wrote:
>>> ...
>>>
>>> There's no point in expressing the concurrency in an algorithm if the
>>> hardware can't do it anyway
>>> (or perhaps there is?)
>>
>> Yes, there is. If the algorithm is best expressed in a concurrent way
>> then you should definitely do so, even if the hardware can't do it.
>> You are saying what you mean. Or it can be a hint to the
>> implementation, for example using pmap instead of map means that you
>> feel that the arguments can be evaluated in parallel, it is then up to
>> the implementation/hardware to do it if possible.
>
> My current problems are about efficiency and not elegance. The elegant
> solution is not
> fast enough. There is no apparent natural concurrency to exploit.
> Sequential code
> that is not fast enough must be speeded up - so we have to find bits
> of concurrency in
> the middle of the sequential code and map it onto exiting hardware
> (warts and all).
> ...
> So far explicit reasoning "we have 4 cores, let's special case it like
> this" is the only approach
> that yields faster systems - we don't have enough experience to
> generalise the results and
> which is worse, each new generation of multi-cores has different
> properties (in terms of inter-core
> message passing times etc. ...)
>
> We want to make the stuff go faster on the hardware we have, not write
> elegant SW for
> hardware that doesn't exists.

Please correct me if i misunderstand, but either you have some
concurrency to exploit or you don't. It doesn't matter if the hardware
has 4 or 44 cores.

IMHO, the first step is (as you started to do) to identify
concurrency. There might be several overlapping and incompatible
alternatives - this is where the current hardware becomes interesting
because different solutions will work on it with different efficiency
grades. In other words, to "special case for 4 cores" implies that
there's something more general available.

best regards,
Vlad

Angel Alvarez

unread,
May 19, 2010, 5:02:06 AM5/19/10
to erlang-q...@erlang.org
El Miércoles, 19 de Mayo de 2010 02:39:42 Robert Virding escribió:
> On 18 May 2010 11:46, Joe Armstrong <erl...@gmail.com> wrote:
> > ...
> >
> > There's no point in expressing the concurrency in an algorithm if the
> > hardware can't do it anyway
> > (or perhaps there is?)
>
> Yes, there is. If the algorithm is best expressed in a concurrent way
> then you should definitely do so, even if the hardware can't do it.
> You are saying what you mean. Or it can be a hint to the
> implementation, for example using pmap instead of map means that you
> feel that the arguments can be evaluated in parallel, it is then up to
> the implementation/hardware to do it if possible.

Again the subttle diference "concurrent vs parallel" where people should think about.

Always you should able to express concurrency (tasks that begin and end while other tasks are still running)
without regard whether the hardware will serialize them, intermix them or run them in parallel.

Where the ability to run things in parallel is the key to performance the ability to express things concurrently
is the key to design complex systems.

I think exposing tasks to the programming enviroment is more important that having constructs that can (not always)
run in parallel.



>
> Robert
>
> ________________________________________________________________
> erlang-questions (at) erlang.org mailing list.
> See http://www.erlang.org/faq.html
> To unsubscribe; mailto:erlang-questio...@erlang.org
>
>



--
No imprima este correo si no es necesario. El medio ambiente está en nuestras manos.
__________________________________________

Clist UAH a.k.a Angel
__________________________________________
Nunca pude estudiar derecho... (El jorobado de Notredame).

Nicolai Waniek

unread,
May 19, 2010, 7:53:09 AM5/19/10
to erlang-q...@erlang.org
[ forwarding my mail here, as it happened that I did not reply to the
list but to Joe direclty :) ]




On 05/19/2010 09:51 AM, Joe Armstrong wrote:
> Well of course you're right, in a way and wrong. It's actually the old
> argument about
> top-down programming in a new guise. In top-down programming you're
> not supposed to know
> what the hardware can actually do. But a quick peep under the covers
> isn't a bad idea
> if you want to write efficient code.

Quite possibly the correct 'solution' (if there is any) of providing a
list of different parallelism pattern is in describing the different
algorithms or patterns used first in a rather mathematical way, looking
at the theoretically possible throughputs or problems that arise and in
a second step looking at the drawbacks that arise with the hardware.

Though this is a bit unrelated as it is not erlang-centered, here's an
example: Though there are lots of parallel algorithms described for
shared memory machines [1], the descriptions are mostly theoretical and
when it comes down to implementing the algorithm, say for GPGPU, you
have severe drawbacks by the concrete underlying hardware. So what
happened to me when searching for a GPGPU algorithm to a rather hard
problem was that the theoretical view helped me to figure out which
already available algorithms would fit and I had an idea on where to
look for, but the lack of description of the problems that arise on
different hardware solutions put me back some time researching how to do
stuff.

In conclusion: Think about describing the patterns in two parts (for
each pattern):

1. Theoretical Pattern
When it comes down to this, every pattern is usually either divide and
conquer or task parallelism. It is quite often possible to determine the
worst case time the algorithm would require for different theoretical
setups.

2. Practical Limitations
This part should mention current (as in 'our hardware at the moment is
this, we don't know what it will be in 10 years') hardware limitations
and quite possible statistics of different problem sizes so that a
reader could decide if the parallelism will be the best way for the
problem at hand.


Describing a parallel pattern should always mention that one has to
figure out if the problem will be best solved in parallel.


regards,
Nicolai


[1] http://techreports.lib.berkeley.edu/accessPages/CSD-88-408.html

Håkan Mattsson

unread,
May 19, 2010, 8:23:01 AM5/19/10
to Joe Armstrong, Robert Virding, Jay Nelson, Erlang Questions
On Wed, May 19, 2010 at 9:51 AM, Joe Armstrong <erl...@gmail.com> wrote:

> We want to make the stuff go faster on the hardware we have,
> not write elegant SW for hardware that doesn't exists.

Right now it feels rather good that you did not have these
thoughts in the ancient single core days when the concurrency
primitives in Erlang was invented. If the main driver for Erlang
concepts would have been speed and not elegance, I suppose
that the Erlang language would look quite different. ;-)

/Håkan
---
Håkan Mattsson
Tail-f Systems

Joe Armstrong

unread,
May 19, 2010, 9:21:17 AM5/19/10
to Vlad Dumitrescu, Robert Virding, Jay Nelson, Erlang Questions
In my case the problem is given, and the hardware is given.
It matters very much how many cores there are and their physical
characteristics,
since I have to map the problem onto a set of concurrent processes that
run on the given hardware.


>
> IMHO, the first step is (as you started to do) to identify
> concurrency.

Done that - the #concurrent processes is far greater than the number of cores

> There might be several overlapping and incompatible
> alternatives - this is where the current hardware becomes interesting
> because different solutions will work on it with different efficiency
> grades. In other words, to "special case for 4 cores" implies that
> there's something more general available.

In my case it's because I have a 4 core computer.

Joe Armstrong

unread,
May 19, 2010, 9:35:22 AM5/19/10
to Håkan Mattsson, Robert Virding, Jay Nelson, Erlang Questions
No no no ....

Thus spake the great progenitors

1) first make it right
2) then make it fast
3) keep it right while making it fast

I've done (1) and it wasn't fast enough - now I'm into 2 and 3.

If you point me at a formal and provably correct way of doing 2 and 3
I'll start using it.
Right now we're doing 2 and 3 by hand. When we have enough experience of 2 and 3
we might be able to generalise the results.

This thread started since I wanted to classify the kind of concurrency problems
and put them into tight sub-groups so I could see how they mapped onto the
hardware we have today.

I have a specific problem in mind, at a top level I can use pipeline
concurrency,
but that does not solve the problem since some of the bits in the pipeline are
bottlenecks - and these bits are not naturally concurrent - they use maps etc.
which I could change to pmaps, but this has to be done with great
care, since it could
easily make matters worse.

And as for the good old days I was "obsessed by efficiency" as I am today.

I still strive to make everything "as inefficient as possible" (ie as
beautiful and clear as possible)
subject to the condition that it is "fast enough" - it must of course
be "fast enough and no faster".

'trouble is project managers do not know the meaning of the expression
"fast enough"

/Joe


2010/5/19 Håkan Mattsson <h...@tail-f.com>:

Joe Armstrong

unread,
May 19, 2010, 9:50:20 AM5/19/10
to Erlang

Eric Newhuis (personal)

unread,
May 19, 2010, 9:57:14 AM5/19/10
to Joe Armstrong, Erlang
What about leased state concurrency? I think I made that phrase up. I'm thinking Linda and tuplespaces and things like async versions of Javaspaces. ...all having a reduced instruction set. Are these covered in your current list?

Vlad Dumitrescu

unread,
May 19, 2010, 10:07:58 AM5/19/10
to Joe Armstrong, Robert Virding, Jay Nelson, Erlang Questions
On Wed, May 19, 2010 at 15:21, Joe Armstrong <erl...@gmail.com> wrote:
> In my case the problem is given, and the hardware is given.
> It matters very much how many cores there are and their physical
> characteristics,
> since I have to map the problem onto a set of concurrent processes that
> run on the given hardware.

I see, I think there are two issues that got mixed up (at least in my
mind they did). You started the discussion asking about identifying
patterns which are abstract and general. This is different than the
latter issue about pmap and number of cores, I think.

Jay Nelson

unread,
May 19, 2010, 11:29:33 AM5/19/10
to Erlang Questions, Joe Armstrong, Vlad Dumitrescu, Robert Virding

On May 19, 2010, at 7:07 AM, Vlad Dumitrescu wrote:

> I see, I think there are two issues that got mixed up (at least in my
> mind they did). You started the discussion asking about identifying
> patterns which are abstract and general. This is different than the
> latter issue about pmap and number of cores, I think.

Indeed!

I thought the thread was about the theoretical and practical task of
identifying and describing concurrent patterns. I had wanted to
start a thread on that topic, but wasn't ready to put the thought
into it so it was a fortunate occurrence when Joe asked about it.
The issues related to the classification scheme are:

1) Identifying and organizing the types of concurrency
2) Expressing the patterns in a standard way so they can be referenced
3) Coming up with illustrative problems and code solutions
4) Identifying the factors which affect the performance of each

A second topic cropped up as Joe thought about his problem in
relation to all the abstract patterns:

> There's no point in expressing the concurrency in an algorithm if the
> hardware can't do it anyway
> (or perhaps there is?)

This question has both theoretical and practical considerations. If
you are interested in defining and describing concurrency, then it
should be expressed as several models with the tradeoffs in
performance that each imply (can it scale up or down, how does it
deal with failure, partial success and recovery, is it manageable as
a system...); however, if you are interested in raw, practical
performance concurrency is not the goal of the problem, but rather
one possible solution.

There is another side of this and it relates to how one develops
software. In the past, we expected compilers and serial performance
to improve over time. Code is written and just automatically runs
faster with newer technology. Just because we are writing concurrent
software, we shouldn't abandon this strategy. Software written for 4
cores today should run fast enough, but I would hope for it to run
faster on 8, 16 or more cores if the problem can scale up (it may run
faster for the same size problem if the granularity is high, but
should definitely run faster for larger problems if you can easily
increase the granularity with problem size, which is often done by
replicating tasks across nodes as map/reduce does).

[By granularity I mean the number of concurrent subtasks, or as the
inverse size of computation for each subtask. Think grains of sand
getting through an hour glass, rather than pebbles.]

Finally, Joe's real motivation is a problem that he can't solve fast
enough right now on 4 cores:

> I have to map the problem onto a set of concurrent processes that
> run on the given hardware.

There are several issues tied up in this statement: the need for a
solution now, the problem of what to do in a future case of the same
dilemma and whether theory helps at all.

Practical:

I would find as much concurrency as is possible, implement that and
measure. Coalesce concurrency where it introduces bottlenecks (too
many messages in queue, send as a block or merge two processes to
eliminate message queue; parallel map => chunked map => serial map).
Iterate measuring and trying alternatives.

If all coalesces to a serial algorithm, get a faster serial computer
or redefine the problem so that results may emerge from several
places in the system rather than after all computation. Another
alternative is to pay very special care to the CPU cache usage and
reformulate the problem to eliminate cache-line faults in low-level
computation. This can have a dramatic effect on performance. It may
require using NIFs to drop into C for critical sections where you can
improve things 100-fold with proper cache usage.

Another approach is to use core-oblivious algorithms which attempt to
use as much concurrency as they can discover. There is a similar
approach to CPU Cache usage http://en.wikipedia.org/wiki/
Cache_oblivious which essentially does recursive dataset splitting
until a size is obtained which fits inside your cache. The
reassembly of the data will make maximal use of the cache on various
platforms without specifying specific parameters (you make an
assumption that, for example, 64 bytes of data will definitely fit in
the cache and bottom out with an optimal implementation of the 64-
byte case).

It sounds like you could use a core-oblivious algorithm with cache-
oblivious algorithms in each process, with autodetection (hopefully
performance observation rather than poking the hardware, since there
may be hyper-threading, etc) causing you to adjust the number of core-
processes.

Future:

Compilers / runtimes should do the work of above. Java Hotspot
analyzes execution and reorders instructions. We should be able to
detect message queue backups or parallel maps or pipelines that have
slow spots and rearrange the code to do something equivalent in a
more efficient way. The code should be written once and the runtime
should adapt to the hardware as best it can through observation,
measurement and tuning in real time.

Theory:

All this practical work, plus a proper classification reference (even
if not truly hierarchical), could provide the basis for a new
approach to the VM, new methods of expressing data or processes, and
new techniques for implementing concurrency vs. serial execution (I
imagine something like an accordion that runs serial when compressed,
runs concurrently when stretched and feedback measurements press the
keys to change the tune).


Nicolai Waniek wrote:

> Describing a parallel pattern should always mention that one has to
> figure out if the problem will be best solved in parallel.

I think a guide to concurrency should be just that. The reader is
expected to reference other guides to consider alternatives. The OO
Patterns book doesn't ask if the problem can be expressed as OO, it
assumes the path chosen and then provides the alternatives.

jay

Sean Hernandez

unread,
May 19, 2010, 3:34:29 PM5/19/10
to Jay Nelson, Erlang Questions, Joe Armstrong, Vlad Dumitrescu, Robert Virding
> Another approach is to use core-oblivious algorithms which attempt to use as much concurrency as they can discover.

> Future:

> Compilers / runtimes should do the work of above. Java Hotspot analyzes execution and reorders instructions. We should be able to detect message queue backups or parallel maps or pipelines that have slow spots and rearrange the code to do something equivalent in a more efficient way. The code should be written once and the runtime should adapt to the hardware as best it can through observation, measurement and tuning in real time.

I think there really separate things: (1) discoverying a machine's
processor, core and memory topology and (2) scheduling instructions
within a core.

Discovering a machine's processor, core and memory topology is
important because of the asymmetry of latencies when accessing memory
that is local to a core or processor vs. memory that is remote (i.e.
NUMA). When you've minimized the probability of accessing remote
memory. Now you can minimize instruction cache misses (can't do
anything for the data cache) using an opportunistic scheduler that
"reuses" instructions that has previously executed (resulting in
out-of-order execution).

These are really about (1) discovering the core topology of the
machine the runtime is executing on taking into account latencies, (2)
affinitizing schedulers for
> Another approach is to use core-oblivious algorithms which attempt to use as much concurrency as they can discover.  There is a similar approach to CPU Cache usage http://en.wikipedia.org/wiki/Cache_oblivious which essentially does recursive dataset splitting until a size is obtained which fits inside your cache.  The reassembly of the data will make maximal use of the cache on various platforms without specifying specific parameters (you make an assumption that, for example, 64 bytes of data will definitely fit in the cache and bottom out with an optimal implementation of the 64-byte case).
>
> It sounds like you could use a core-oblivious algorithm with cache-oblivious algorithms in each process, with autodetection (hopefully performance observation rather than poking the hardware, since there may be hyper-threading, etc) causing you to adjust the number of core-processes.

Sean Hernandez

unread,
May 19, 2010, 4:17:22 PM5/19/10
to Jay Nelson, Erlang Questions, Joe Armstrong, Vlad Dumitrescu, Robert Virding
But in order to do adaptive execution a model of your execution
environment needs to be built (e.g. processor, core, memory (this
includes cache sizes too btw), execution latencies, queue lengths,
similarity of instructions (instruction caches misses, etc) up so that
adaptation algorithms can figure out what to tweak.

Maybe these are the elements of the provably correct model that Joe is
looking for?

On Wed, May 19, 2010 at 2:56 PM, Jay Nelson <j...@duomark.com> wrote:
>> These are really about (1) discovering the core topology of the
>> machine the runtime is executing on taking into account latencies, (2)
>> affinitizing schedulers for
>
> An alternative is to watch the execution, reducing the data set size, until
> a speed up is seen.  In this way you don't optimize based on how it _should_
> work, but on how the system is actually performing.  Once the optimal data
> set size is determined, the data can be automatically chopped up
> accordingly.
>
> Better to detect cache (or who knows what future technology) speed up so
> that the code is automatically adaptive on alternative hardware.  Likewise,
> you could build a large queue of tasks, then spawn new workers periodically
> measuring number of tasks / unit of time.  At some threshold the overhead
> will show up and slow down progress, so throttle back the number of open
> processes in the worker pool, dynamically deciding based performance rather
> than number of cores, or schedulers or what have you.
>
> jay
>
>

________________________________________________________________
erlang-questions (at) erlang.org mailing list.
See http://www.erlang.org/faq.html
To unsubscribe; mailto:erlang-questio...@erlang.org

Jay Nelson

unread,
May 19, 2010, 3:56:06 PM5/19/10
to Erlang Questions, Joe Armstrong, Sean Hernandez, Vlad Dumitrescu, Robert Virding
> These are really about (1) discovering the core topology of the
> machine the runtime is executing on taking into account latencies, (2)
> affinitizing schedulers for

An alternative is to watch the execution, reducing the data set size,
until a speed up is seen. In this way you don't optimize based on
how it _should_ work, but on how the system is actually performing.
Once the optimal data set size is determined, the data can be
automatically chopped up accordingly.

Better to detect cache (or who knows what future technology) speed up
so that the code is automatically adaptive on alternative hardware.
Likewise, you could build a large queue of tasks, then spawn new
workers periodically measuring number of tasks / unit of time. At
some threshold the overhead will show up and slow down progress, so
throttle back the number of open processes in the worker pool,
dynamically deciding based performance rather than number of cores,
or schedulers or what have you.

jay


________________________________________________________________
erlang-questions (at) erlang.org mailing list.
See http://www.erlang.org/faq.html
To unsubscribe; mailto:erlang-questio...@erlang.org

Jay Nelson

unread,
May 19, 2010, 6:07:42 PM5/19/10
to Erlang Questions, Sean Hernandez, Joe Armstrong, Vlad Dumitrescu, Robert Virding

On May 19, 2010, at 1:17 PM, Sean Hernandez wrote:

> But in order to do adaptive execution a model of your execution
> environment needs to be built (e.g. processor, core, memory (this
> includes cache sizes too btw), execution latencies, queue lengths,
> similarity of instructions (instruction caches misses, etc) up so that
> adaptation algorithms can figure out what to tweak.

If you are just copying data from one array to another, you can
tell when a cache boundary is tripped by the slowdown in the
performance. If you write progressively more data, measuring
the time lapse, you can find the array size that first causes the
slowdown. You don't need to do any complicated modeling.
Then you can make your divide and conquer algorithm choose
the maximum size for each of the divide elements to get maximal
performance.

As I said earlier, the same thing can be done with a task queue
and a worker pool. If there are 1M tasks over an hour, it is easy
to adjust the number of workers every 10 seconds for the first 10
minutes to get a feel for performance. Then choose the optimal
number of workers for the rest of the time.

You just need to have elements of your algorithm that can be
tuned (e.g., number of processes, data sequence size in divide
and conquer, message batch size, etc).

French, Mike

unread,
May 20, 2010, 6:06:13 AM5/20/10
to Joe Armstrong, Erlang

Also see this work on nested data parallelism for Haskell:

http://haskell.org/haskellwiki/GHC/Data_Parallel_Haskell

there are two good presentations linked in the Further Reading section.

Mike


> -----Original Message-----
> From: erlang-q...@erlang.org
> [mailto:erlang-q...@erlang.org]On
> Behalf Of Joe Armstrong
> Sent: 19 May 2010 14:50
> To: Erlang
> Subject: [erlang-questions] Re: Classification of concurrency patterns
>
>
> Found these they are looking at a similar problem
>
> http://www.drdobbs.com/architecture-and-design/224900184
> http://parlab.eecs.berkeley.edu/wiki/patterns/patterns
>
> /Joe

Thales UK Ltd (Wells) DISCLAIMER: The information contained in this e-mail
is confidential. It may also be legally privileged. It is intended only for
the stated addressee(s) and access to it by any other person is
unauthorised. If you are not an addressee, you must not disclose, copy,
circulate or in any other way use or rely on the information contained in
this e-mail. Such unauthorised use may be unlawful. We may monitor all
e-mail communications through our networks. If you have received this e-mail
in error, please inform us immediately on sender's telephone number above
and delete it and all copies from your system. We accept no responsibility
for changes to any e-mail which occur after it has been sent. Attachments
to this e-mail may contain software viruses which could damage your system.
We therefore recommend you virus-check all attachments before opening.
Thales UK Ltd. Registered Office: 2 Dashwood Lang Road, The Bourne Business
Park, Addlestone, Weybridge, Surrey KT15 2NX Registered in England No.
868273
Reply all
Reply to author
Forward
0 new messages