Concurency in the functional world

Arne Hanssen

unread,

Nov 16, 2007, 5:18:48 PM11/16/07

to

One popular claim found in litterateur about functional programming is
that programs written in this class of languages is very suitable for
concurrent execution, or interpretation. Due to factors like
referential transparency and the Church-Rosser theorem.

As an outsider I am wondering how concurrency is treated in the
functional world. Is it a major research field, do people think about
it when programming and do the languages offers concurrent features. I
guess this last part exclude Erlang since its so heavily geared
towards concurrency in its core.

And what impact will the, lets call it "the-multi-core-trend" have on
functional programming and programming languages?.

Cheers
-arne

Jon Harrop

unread,

Nov 16, 2007, 7:51:49 PM11/16/07

to

Arne Hanssen wrote:
> As an outsider I am wondering how concurrency is treated in the
> functional world. Is it a major research field, do people think about
> it when programming and do the languages offers concurrent features. I
> guess this last part exclude Erlang since its so heavily geared
> towards concurrency in its core.

I've worked on concurrent OCaml and F# applications with great success. The
approaches are quite different but the ability to write in a purely
functional style is hugely beneficial.

> And what impact will the, lets call it "the-multi-core-trend" have on
> functional programming and programming languages?.

Unlike many other languages, functional languages will survive the
transition to multicore.

--
Dr Jon D Harrop, Flying Frog Consultancy Ltd.
http://www.ffconsultancy.com/products/?u

Paul Rubin

unread,

Nov 16, 2007, 8:06:44 PM11/16/07

to

Jon Harrop <use...@jdh30.plus.com> writes:
> Unlike many other languages, functional languages will survive the
> transition to multicore.

Cool. Still waiting to find out how. The traditional stuff with
threads and exceptions seems doomed, even using STM.

Jon Harrop

unread,

Nov 16, 2007, 8:12:30 PM11/16/07

to

Well, depends what you're doing, of course, but for IO-bound concurrency
everyone is basically copying Erlang sans the ability to do massive
concurrency. For CPU-intensive concurrency, a few basic thread-based HOFs
are doing me proud.

Arne Hanssen

unread,

Nov 17, 2007, 3:39:36 AM11/17/07

to

In article <13jsfa3...@corp.supernews.com>,

Jon Harrop <use...@jdh30.plus.com> wrote:

> I've worked on concurrent OCaml and F# applications with great success. The
> approaches are quite different but the ability to write in a purely
> functional style is hugely beneficial.

Can you outline or explain what this difference you encountered is. My
initial thought is that giving the properties of functional programming
concurrency is something that the compiler or the interpreter should be
able to deal with to some extend.

> Unlike many other languages, functional languages will survive the
> transition to multicore.

Because?
-arne

Jon Harrop

unread,

Nov 17, 2007, 9:39:14 AM11/17/07

to

Arne Hanssen wrote:
> In article <13jsfa3...@corp.supernews.com>,
> Jon Harrop <use...@jdh30.plus.com> wrote:
>> I've worked on concurrent OCaml and F# applications with great success.
>> The approaches are quite different but the ability to write in a purely
>> functional style is hugely beneficial.
>
> Can you outline or explain what this difference you encountered is.

I already blogged it:

http://flyingfrogblog.blogspot.com/2007/07/concurrency.html

> My
> initial thought is that giving the properties of functional programming
> concurrency is something that the compiler or the interpreter should be
> able to deal with to some extend.

I don't think direct compiler support is necessary.

>> Unlike many other languages, functional languages will survive the
>> transition to multicore.
>
> Because?

Functional programs are very easy to parallelize.

For example, we've been working on a high-performance hardware-accelerated
vector graphics library for many years:

http://www.ffconsultancy.com/products/smoke_vector_graphics/

The first version was written in C++. Then we rewrote the whole product in
OCaml. We recently ported it to F# and added support for concurrency.
Adding concurrency required less than a dozen lines of code (this is a
250kLOC project) and the results scale extremely well.

Marshall

unread,

Nov 17, 2007, 10:56:59 AM11/17/07

to

You have to have *some* way of handling mutation, or
else you are limiting the classes of applications you can
write. Any data management application needs some
way to *manage* the data. In that area, transactions*
are the clear winner.

Marshall

PS. Why does it suddenly have to have a three letter
acronym? It's been "transactions" for decades, and now
it's all trendy to call it "STM." Very annoying.</rant>

Florian Weimer

unread,

Nov 17, 2007, 11:29:15 AM11/17/07

to

* Marshall:

> PS. Why does it suddenly have to have a three letter
> acronym? It's been "transactions" for decades, and now
> it's all trendy to call it "STM."

STM typically has two facilities that traditional transaction processing
lacks, called "retry" and "orElse" in the GHC implementation. "retry"
is a hack which probably doesn't scale that well. "orElse" adds
non-determinism which could be a good idea--or not.

The main criticism of STM is that it promotes the wrong model: shared
state which is concurrently updated.

Paul Rubin

unread,

Nov 17, 2007, 11:34:16 AM11/17/07

to

Florian Weimer <f...@deneb.enyo.de> writes:
> The main criticism of STM is that it promotes the wrong model: shared
> state which is concurrently updated.

I keep hearing that, but I still don't hear what the alternative is,
that still manages to use the capabilites of the hardware.

Florian Weimer

unread,

Nov 17, 2007, 11:41:27 AM11/17/07

to

* Paul Rubin:

Copy-based message-passing among different processes (= threads with
separate address spaces), for CPU-bound tasks. For mostly-blocking I/O
dispatching, lightweight threads.

I don't know what to do in the case that is bound by per-CPU I/O
bandwidth, this is difficult to parallelize using any method.

Vesa Karvonen

unread,

Nov 17, 2007, 11:59:41 AM11/17/07

to

Marshall <marshal...@gmail.com> wrote:
[...]

> You have to have *some* way of handling mutation, or else you are
> limiting the classes of applications you can write. Any data
> management application needs some way to *manage* the data. In that
> area, transactions* are the clear winner.

Yeah, that is the current hype. Note that you can implement mutable
ref cells using threads and message passing. In general, if you need
to share a particular data structure, you can make it so that a single
thread owns that data structure and other threads access/update it
through messages sent to that thread. No mutation needed.

-Vesa Karvonen

Paul Rubin

unread,

Nov 17, 2007, 12:14:57 PM11/17/07

to

Vesa Karvonen <vesa.k...@cs.helsinki.fi> writes:
> Yeah, that is the current hype. Note that you can implement mutable
> ref cells using threads and message passing. In general, if you need
> to share a particular data structure, you can make it so that a single
> thread owns that data structure and other threads access/update it
> through messages sent to that thread. No mutation needed.

Right, but now every memory access is potentially replaced by
kernel-mediated message passing with context switches, etc. There is
a reason we keep buying ram instead of putting the data on much
cheaper disk. It's because disk is orders of magnitude slower.
There's a similar situation with message passing.

Stephen J. Bevan

unread,

Nov 17, 2007, 1:53:45 PM11/17/07

to

Paul Rubin <http://phr...@NOSPAM.invalid> writes:
> Vesa Karvonen <vesa.k...@cs.helsinki.fi> writes:
>> Yeah, that is the current hype. Note that you can implement mutable
>> ref cells using threads and message passing. In general, if you need
>> to share a particular data structure, you can make it so that a single
>> thread owns that data structure and other threads access/update it
>> through messages sent to that thread. No mutation needed.
>
> Right, but now every memory access is potentially replaced by
> kernel-mediated message passing with context switches, etc.

Under Linux you can implement interprocess message passing using a
futex to protect the message queue and shared memory to hold the
messages. In this case context switching only occurs if there is
contention on the futex or you don't have enough physical CPUs to
support the processes you want to run. Both issues can occur in a
concurrent shared memory program with application specific locks. So,
the difference would come down to lock contention rates and some
additional memory&time to handle the queues.

> There is a reason we keep buying ram instead of putting the data on
> much cheaper disk. It's because disk is orders of magnitude slower.
> There's a similar situation with message passing.

IMHO writing shared memory concurrent programs, like writing
assembler, requires more skill than writing sequential high level
programs. Many think they have the skill, but most don't. The result
can be programs that fail/lock-up in hard to reproduce situations,
though often the customer can reproduce it regularly to their great
annoyance. Faced with that kind of problem, message paassing can look
a lot more attractive.

Vesa Karvonen

unread,

Nov 17, 2007, 2:33:49 PM11/17/07

to

Paul Rubin <http://phr...@nospam.invalid> wrote:
> Vesa Karvonen <vesa.k...@cs.helsinki.fi> writes:
> > Yeah, that is the current hype. Note that you can implement mutable
> > ref cells using threads and message passing. In general, if you need
> > to share a particular data structure, you can make it so that a single
> > thread owns that data structure and other threads access/update it
> > through messages sent to that thread. No mutation needed.

> Right, but now every memory access is potentially replaced by
> kernel-mediated message passing with context switches, etc.

You are totally wrong. First of all, the use of message passing
concurrency does not require changing the way memory accesses within a
thread are performed (which is almost all of them). More importantly,
there is no fundamental reason why passing a message from one thread
to another would have to be inefficient. What you are describing
corresponds to a naive heavy-weight implementation.

> There is a reason we keep buying ram instead of putting the data on
> much cheaper disk. It's because disk is orders of magnitude slower.
> There's a similar situation with message passing.

Your analogy is bogus. Message passing between threads that have
shared memory can be made very efficient (i.e. no extra copying, just
update message queue with pointer to immutable message and signal the
recipient) --- orders of magnitude faster than passing messages
between separate processes (without shared memory) or machines.

-Vesa Karvonen

David B. Benson

unread,

Nov 17, 2007, 2:44:08 PM11/17/07

to

On Nov 16, 2:18 pm, Arne Hanssen <arne.hansenNOS...@runbox.com>
wrote:
> ...

> As an outsider I am wondering how concurrency is treated in the
> functional world. Is it a major research field, do people think about
> it when programming and do the languages offers concurrent features.

Some research: the 2007 ICFP seems to have one paper (which addresses
multi-core parallelism with an ML-like language).

I program in SML/NJ using John Reppy's Concurrent ML (on a single core
machine) for problems which could conceivable be solved in anothr
manner. The threads are light-weight and message passing provides
both design convenience and reliability at a small expense of run-time
'inefficiency'.

SML/NJ offers no concurrent features, per se. Reppy simply coded his
Conceuurent ML in SML/NJ, making considerable use of the callcc
available in SML/NJ (and not strictly available in a compiler
conforming to the Standard ML specification).

Paul Rubin

unread,

Nov 17, 2007, 6:03:23 PM11/17/07

to

Vesa Karvonen <vesa.k...@cs.helsinki.fi> writes:
> Your analogy is bogus. Message passing between threads that have
> shared memory can be made very efficient (i.e. no extra copying, just
> update message queue with pointer to immutable message and signal the
> recipient) --- orders of magnitude faster than passing messages
> between separate processes (without shared memory) or machines.

If it's that simple there must be lots of systems that actually do it
that way. So I'd be interested in seeing some examples, especially
examples with measured performance numbers. I'm concerned not only
with the overhead of actual message passing, but with the latency of
waiting for a response from the other process.

I see the problem of concurrency as being like storage allocation, in
that doing it manually (even with transactions) leads to intricate and
bug-prone code. The manual storage allocation problem is solved by
garbage collection at a performance cost of X percent, where in the
functional programming context we mostly agree that X is tolerable in
real applications (let's say it's observably in general no worse than
a factor of 2 compared with manual allocation), so we all use GC.

What I'm skeptical of is a claim that message passing can solve
concurrency without imposing a much higher number for X, compared with
using transactions or fine grained locks. A 2x penalty might well be
tolerable but 20x or 100x might well not be. If we're talking about
heavyweight IPC with system calls, it's probably more like 1000x. So
measured numbers from the lightweight approach would be very
interesting.

Paul Rubin

unread,

Nov 17, 2007, 6:16:24 PM11/17/07

to

ste...@dino.dnsalias.com (Stephen J. Bevan) writes:
> IMHO writing shared memory concurrent programs, like writing
> assembler, requires more skill than writing sequential high level
> programs. Many think they have the skill, but most don't. The result
> can be programs that fail/lock-up in hard to reproduce situations,
> though often the customer can reproduce it regularly to their great
> annoyance. Faced with that kind of problem, message paassing can look
> a lot more attractive.

I agree with this, but the pure performance cost of writing in high
level languages instead of assemblers has been quantified and found to
not be too bad. High level languages gained acceptance once the cost
was found to be no worse than, say, a factor of 2. The same thing
happened with manual storage allocation vs. garbage collection. Maybe
something similar happens with functional vs. imperative languages.
I'll invent a rule of thumb: a new solution is good when it cleans up
the mess and bugginess of the old approach and costs no worse than a
factor of 2 in performance. Are you saying that message passing with
futexes comes within a factor of 2 of shared memory concurrency,
including on memory-access-intensive classes of problems?

Of course I'm not saying that the bug-prone, shared memory approach is
the solution. Rather, we may still be in a situation where no good
solution exists.

Jon Harrop

unread,

Nov 17, 2007, 7:37:40 PM11/17/07

to

Vesa Karvonen wrote:
> Yeah, that is the current hype. Note that you can implement mutable
> ref cells using threads and message passing. In general, if you need
> to share a particular data structure, you can make it so that a single
> thread owns that data structure and other threads access/update it
> through messages sent to that thread. No mutation needed.

How is updating a data structure not mutation?

Vesa Karvonen

unread,

Nov 17, 2007, 7:48:18 PM11/17/07

to

Paul Rubin <http://phr...@nospam.invalid> wrote:

[...]

> I'm concerned not only with the overhead of actual message passing,
> but with the latency of waiting for a response from the other
> process.

[...]

> What I'm skeptical of is a claim that message passing can solve
> concurrency without imposing a much higher number for X, compared
> with using transactions or fine grained locks. A 2x penalty might
> well be tolerable but 20x or 100x might well not be. If we're
> talking about heavyweight IPC with system calls, it's probably more
> like 1000x. So measured numbers from the lightweight approach would
> be very interesting.

Just a quick note before I go to sleep. I think that you have in your
head a wrong picture of what message passing concurrency means. I
would recommend that you take a look at Concurrent ML (CML), for
example. In general (I'm not talking specifically about CML), a
message can be a value of (roughly) any type (an int, the unit value,
an arbitrarily complex data structure). Messages can be sent, for
example, through "synchronous variables" (e.g. m-vars). In terms of
fine grained locks, sending a message doesn't have to cost any more
(and can even be made cheaper when lower level operations are
available) than roughly acquiring a mutex, writing a pointer to the
message value (or just writing the message value if it is roughly the
same size as a pointer and unboxed), signaling a condition variable,
and releasing the mutex.

-Vesa Karvonen

David B. Benson

unread,

Nov 17, 2007, 7:48:31 PM11/17/07

to

On Nov 17, 3:03 pm, Paul Rubin ...

>
> What I'm skeptical of is a claim that message passing can solve
> concurrency without imposing a much higher number for X, compared with
> using transactions or fine grained locks.

Here is a benchmark of sorts.

<a href="http://shootout.alioth.debian.org/gp4sandbox/benchmark.php?
test=message&lang=all"> Sending values between many threads</a>

Although this doen't directly answer your question.

Vesa Karvonen

unread,

Nov 17, 2007, 7:55:42 PM11/17/07

to

Jon Harrop <use...@jdh30.plus.com> wrote:
> Vesa Karvonen wrote:
> > Yeah, that is the current hype. Note that you can implement mutable
> > ref cells using threads and message passing. In general, if you need
> > to share a particular data structure, you can make it so that a single
> > thread owns that data structure and other threads access/update it
> > through messages sent to that thread. No mutation needed.

> How is updating a data structure not mutation?

How is a coroutine that adds a key to a set implemented as a
persistent AVL-tree not using mutation?

Why don't you go and educate yourself by reading Reppy's Concurrent
Programming in ML [1], for example. You might just save yourself from
the hubris of reinventing the wheel.

-Vesa Karvonen

[1] http://lambda-the-ultimate.org/node/2499#comment-38108

Paul Rubin

unread,

Nov 17, 2007, 8:06:15 PM11/17/07

to

"David B. Benson" <dbe...@eecs.wsu.edu> writes:
> Here is a benchmark of sorts.
>
> <a href="http://shootout.alioth.debian.org/gp4sandbox/benchmark.php?
> test=message&lang=all"> Sending values between many threads</a>

Thanks, that's a start, though I don't see anything there about
multiple cpu's and I don't see any comparison with directly accessing
shared memory.

Jon Harrop

unread,

Nov 17, 2007, 8:02:44 PM11/17/07

to

Paul Rubin wrote:
> Right, but now every memory access is potentially replaced by
> kernel-mediated message passing with context switches, etc. There is
> a reason we keep buying ram instead of putting the data on much
> cheaper disk. It's because disk is orders of magnitude slower.
> There's a similar situation with message passing.

Yes. For example, OCaml and F# provide different approaches to concurrency
that typically boil down to equivalent message passing. The underlying
technology used to implement this is likely to produce huge differences in
real-time performance for applications requiring fine-grained concurrency
but I have not yet made any measurements except that .NET locks cost around
1us, threads 1ms and Unix fork around 1cs.

Until someone presents quantitative benchmark results there is no point in
speculating about the size of the trade-offs involved.

Lauri Alanko

unread,

Nov 17, 2007, 8:12:21 PM11/17/07

to

In article <13jv2rf...@corp.supernews.com>,

Jon Harrop <use...@jdh30.plus.com> wrote:
> How is updating a data structure not mutation?

Here's a ref cell in Erlang:

1> import(ref).
ok
2> R = ref:new(foo).
<0.33.0>
3> ref:get(R).
foo
4> ref:set(R, bar).
{set,bar}
5> ref:get(R).
bar

So it acts just as if it were a mutable data structure. Here's the
implementation:

-module(ref).
-export([new/1, set/2, get/1, run/1]).

run(X) ->
receive
{From, get} ->
From ! {self(), X},
run(X);
{set, Y} ->
run(Y)
end.

new(X) ->
spawn(ref, run, [X]).

set(R,X) ->
R ! {set, X}.

get(R) ->
R ! {self(), get},
receive
{R, V} -> V

This means that a ref cell is a process that runs in a loop
continuously, handling get- and set-requests. Upon a set-request, it
just tail-calls itself with a new value. The "state" of the cell is in
the environment of the function. So there's no more mutation than in
an ordinary tail-recursive loop.

Lauri

Jon Harrop

unread,

Nov 17, 2007, 8:17:06 PM11/17/07

to

Vesa Karvonen wrote:
> Jon Harrop <use...@jdh30.plus.com> wrote:
>> Vesa Karvonen wrote:
>> > Yeah, that is the current hype. Note that you can implement mutable
>> > ref cells using threads and message passing. In general, if you need
>> > to share a particular data structure, you can make it so that a single
>> > thread owns that data structure and other threads access/update it
>> > through messages sent to that thread. No mutation needed.
>
>> How is updating a data structure not mutation?
>
> How is a coroutine that adds a key to a set implemented as a
> persistent AVL-tree not using mutation?

Does the implementation require a concurrent GC? If so, has anyone managed
to make one without sacrificing a lot of performance?

Are there any implementations and examples that I can run on x86-64 to
obtain quantitative performance results and compare with other systems?

> Why don't you go and educate yourself by reading Reppy's Concurrent
> Programming in ML [1], for example. You might just save yourself from
> the hubris of reinventing the wheel.

I'll check it out but I'm just following the big three (OCaml, F# and
Haskell) rather than inventing anything.

Jon Harrop

unread,

Nov 17, 2007, 8:22:23 PM11/17/07

to

Lauri Alanko wrote:
> This means that a ref cell is a process that runs in a loop
> continuously, handling get- and set-requests. Upon a set-request, it
> just tail-calls itself with a new value. The "state" of the cell is in
> the environment of the function. So there's no more mutation than in
> an ordinary tail-recursive loop.

For all intents and purposes that looks exactly like mutation to me. In
particular, it is just as dangerous because reliance upon state will be
implicitly littered throughout the code.

Lauri Alanko

unread,

Nov 17, 2007, 8:57:39 PM11/17/07

to

In article <13jv5f9...@corp.supernews.com>,

Jon Harrop <use...@jdh30.plus.com> wrote:
> For all intents and purposes that looks exactly like mutation to me. In
> particular, it is just as dangerous because reliance upon state will be
> implicitly littered throughout the code.

That's a bizarre remark, coming out of the blue. "Will be littered"?
When we are trying to do what with what?

You'd better make clear exactly what kinds of requirements you're
envisioning. Presumably you're seeing message-passing concurrency as
one way to fulfill those requirements, and also presumably your
requirements do not logically imply reliance on state, or otherwise
your comment would have been vacuous. But it's impossible to say
anything more without knowing what you're talking about.

Lauri

Stephen J. Bevan

unread,

Nov 17, 2007, 9:14:52 PM11/17/07

to

Paul Rubin <http://phr...@NOSPAM.invalid> writes:

> [snip]

> I'll invent a rule of thumb: a new solution is good when it cleans up
> the mess and bugginess of the old approach and costs no worse than a
> factor of 2 in performance.

Does Python pass the factor of 2 test compared to C? It doesn't in
the Shootout results. Does that mean Python is not good as defined
above? If so, why are people writing Python programs if the rule of
thumb is relevant?

> Are you saying that message passing with futexes comes within a
> factor of 2 of shared memory concurrency, including on
> memory-access-intensive classes of problems?

It may or it may not depending on the problem, just as it may or may
not when comparing say Python vs C or C vs assembler. Whether that
matters depends on what weight you give to this the rule of thumb
compared with other measures such as productivity (real or perceived).

Jon Harrop

unread,

Nov 17, 2007, 9:06:15 PM11/17/07

to

Lauri Alanko wrote:
> In article <13jv5f9...@corp.supernews.com>,
> Jon Harrop <use...@jdh30.plus.com> wrote:
>> For all intents and purposes that looks exactly like mutation to me. In
>> particular, it is just as dangerous because reliance upon state will be
>> implicitly littered throughout the code.
>
> That's a bizarre remark, coming out of the blue.

Take your code:

1> import(ref).
ok
2> R = ref:new(foo).
<0.33.0>
3> ref:get(R).
foo
4> ref:set(R, bar).
{set,bar}
5> ref:get(R).
bar

and translate it into the completely naive low-level nasty mutating style of
coding that we are trying to avoid:

let r = ref "foo"
!r
r := "bar"
!r

Notice how it is exactly the same.

> When we are trying to do what with what?

Write robust code using concurrent constructs. The code you gave is a
sequence of side-effecting statements. I'd like to see if the same
functionality can be obtained efficiently by composing purely functional
expressions because that will make it much easier to write correct
programs.

I also have other reservations about this design. In particular, it seems to
be asymmetric. For example, what happens when I want to compute the union
of two sets?

Lauri Alanko

unread,

Nov 17, 2007, 9:56:39 PM11/17/07

to

In article <13jv81h...@corp.supernews.com>,

Jon Harrop <use...@jdh30.plus.com> wrote:
> and translate it into the completely naive low-level nasty mutating style of
> coding that we are trying to avoid:
>
> let r = ref "foo"
> !r
> r := "bar"
> !r
>
> Notice how it is exactly the same.

Yes, that was my purpose: show how a traditional ref cell can be
implemented in Erlang if you really wanto to. Though of course the
Erlang ref is thread-safe.

> > When we are trying to do what with what?
>
> Write robust code using concurrent constructs. The code you gave is a
> sequence of side-effecting statements.

The only side-effects in Erlang are the message passing and receiving
operations. Even their side-effectness depends on the definition: they
certainly depend on a global state (the state or continuation of other
processes at the time of calling), but not on a traditional "store"
like ref cells do.

> I'd like to see if the same
> functionality can be obtained efficiently by composing purely functional
> expressions because that will make it much easier to write correct
> programs.

Your requirements are contradictory. I cannot imagine what
"concurrency" would mean in a purely functional setting. Concurrency
implies the existence of independent processes/threads/whatever that
can somehow affect each other's behavior. Any operation by one process
that depends on what another process has done cannot be referentially
transparent.

Perhaps you're after _parallelism_, i.e. distributing workload? This
can certainly be done in a purely functional setting:

http://haskell.org/haskellwiki/GHC/Data_Parallel_Haskell

But still, it has nothing to do with concurrency.

> I also have other reservations about this design. In particular, it seems to
> be asymmetric. For example, what happens when I want to compute the union
> of two sets?

Then you compute the union of two sets. Perhaps you again left out
some requirements?

If you mean that there are two processes, both of which can be
requested for set data, then you can just ask a (functional) set value
from each of them and do the actual union purely functionally.

Or perhaps you were thinking of some transactional requirements? These
are very real, of course, and to my knowledge there are no magic
bullets for managing transactions in a truly distributed environment.
STM is neat, but it doesn't scale beyond a single address space.
That's why I think it's only an interim solution.

Lauri

Paul Rubin

unread,

Nov 17, 2007, 10:41:54 PM11/17/07

to

ste...@dino.dnsalias.com (Stephen J. Bevan) writes:

> Does Python pass the factor of 2 test compared to C? It doesn't in
> the Shootout results. Does that mean Python is not good as defined
> above? If so, why are people writing Python programs if the rule of
> thumb is relevant?

At the types of problems C was intended to solve (writing low level
programs with performance comparable to assembler), Python is
absolutely terrible, not even on the scale. C then started being used
for other types of purposes--often by programmers who weren't familiar
with alternatives like Lisp--at applications for which expressiveness
mattered more than performance. C was lousy for those applications
and Python turned out to be pretty good for them.

Once we're talking about performance, we're back in C's playing field.
And I think the subject currently at hand is multiprocessor
parallelism, i.e. making your program run some reasonable fraction of
4x as fast if you have 4 cpu's. If your program is in Python you can
make it run at least 10x as fast by choosing a different but
comparably expressive language in the first place, I'd say Python is a
non-starter in this area. C++ seems to be the current mainstream
choice for complex parallel non-numeric apps (dunno about numerics: I
know about Jon but he's non-mainstream). Functional language
aficionados point to shootouts that say comparable results can be
achieved with the ML family and/or Haskell or even some Lisps, so this
is mainly what I'm asking about. Python is currently not in the running.

Stephen J. Bevan

unread,

Nov 18, 2007, 3:03:01 AM11/18/07

to

Paul Rubin <http://phr...@NOSPAM.invalid> writes:
> [snip]

> And I think the subject currently at hand is multiprocessor
> parallelism, i.e. making your program run some reasonable fraction of
> 4x as fast if you have 4 cpu's.

> [snip]

Indeed, so that leads us to what is a reasonable fraction and what
does it cost to achieve? There is no single answer to those
questions. However, I would argue that anyone who chooses pure shared
memory for performance reasons without benchmarking a (shared memory)
message passing solution is guilty of premature optimization.

Paul Rubin

unread,

Nov 18, 2007, 4:17:38 AM11/18/07

to

ste...@dino.dnsalias.com (Stephen J. Bevan) writes:

> Indeed, so that leads us to what is a reasonable fraction and what
> does it cost to achieve? There is no single answer to those
> questions. However, I would argue that anyone who chooses pure shared
> memory for performance reasons without benchmarking a (shared memory)
> message passing solution is guilty of premature optimization.

Well, I haven't seen any benchmarks yet comparing the two. Anyway the
best outcome is to find good abstractions that make pure shared memory
easier to manage. Transactions are certainly an improvement over ad
hoc locking schemes, but they apparently are still not enough.

Vesa Karvonen

unread,

Nov 18, 2007, 6:51:17 AM11/18/07

to

Paul Rubin <http://phr...@nospam.invalid> wrote:
[...]

> Anyway the best outcome is to find good abstractions that make pure

> shared memory easier to manage. [...]

Why on earth would that be the "best outcome"? Seriously, you
essentially want to shoot yourself in the foot by being able to
"manage" sharing even more complicated data structures. Frankly, STM
gives me the shivers. Not because it wouldn't make shared memory
concurrency more manageable (it does, and can be useful in special
circumstances), but because the whole idea of mutable shared memory
concurrency is absolutely horrible. I mean, if mutable variables are
bad even in a single threaded context (indeed, doesn't functional
programming matter anymore?), then how can anyone in their right mind
claim that the best way to program concurrently is to share mutable
state across threads. That is hubris flavored crazy talk. Even with
the guarantees given by transactions, the programming model is
absolutely worse than single threaded imperative programming --- a
model of programming whose (over)use in this group is rightly frowned
upon.

-Vesa Karvonen

Jon Harrop

unread,

Nov 18, 2007, 7:08:10 AM11/18/07

to

Lauri Alanko wrote:
> In article <13jv81h...@corp.supernews.com>,
> Jon Harrop <use...@jdh30.plus.com> wrote:
>> I'd like to see if the same
>> functionality can be obtained efficiently by composing purely functional
>> expressions because that will make it much easier to write correct
>> programs.
>
> Your requirements are contradictory. I cannot imagine what
> "concurrency" would mean in a purely functional setting. Concurrency
> implies the existence of independent processes/threads/whatever that
> can somehow affect each other's behavior. Any operation by one process
> that depends on what another process has done cannot be referentially
> transparent.

F#'s asynchronous workflows appear to provide what I described. In the
simplest case, you can just take the let binding of an expression:

let x = ...

and have it evaluated concurrently:

let! x = ...

The code can still be purely functional. Any interaction between
processes/threads/whatever is completely abstracted away.

> Perhaps you're after _parallelism_, i.e. distributing workload? This
> can certainly be done in a purely functional setting:
>
> http://haskell.org/haskellwiki/GHC/Data_Parallel_Haskell
>
> But still, it has nothing to do with concurrency.

Well, it provides a "concurrent, high-performance library of strict,
segmented arrays, which we call package ndp".

>> I also have other reservations about this design. In particular, it seems
>> to be asymmetric. For example, what happens when I want to compute the
>> union of two sets?
>
> Then you compute the union of two sets. Perhaps you again left out
> some requirements?
>
> If you mean that there are two processes, both of which can be
> requested for set data, then you can just ask a (functional) set value
> from each of them and do the actual union purely functionally.

Let me rephrase: how would you implement an efficient and concurrent set
union?

> Or perhaps you were thinking of some transactional requirements?

I was thinking: if each data structure is owned by a thread, how do you
combine data structures to create new ones? Set union is one example of
doing that: I don't want to just fetch elements sequentially (that is
inefficient).

> These
> are very real, of course, and to my knowledge there are no magic
> bullets for managing transactions in a truly distributed environment.
> STM is neat, but it doesn't scale beyond a single address space.
> That's why I think it's only an interim solution.

Ok.

Marshall

unread,

Nov 18, 2007, 11:40:57 AM11/18/07

to

On Nov 18, 1:17 am, Paul Rubin <http://phr...@NOSPAM.invalid> wrote:
>
> Well, I haven't seen any benchmarks yet comparing the two. Anyway the
> best outcome is to find good abstractions that make pure shared memory
> easier to manage. Transactions are certainly an improvement over ad
> hoc locking schemes, but they apparently are still not enough.

Why do you say they're not enough?

My view:

1) Shared mutable memory should be used the least amount possible
2) The least amount possible is nonzero, but much smaller than is
currently standard practice
3) Transactions provide the simplest possible way to do this

I really do mean the simplest *possible* way, too.

Marshall

Stephen J. Bevan

unread,

Nov 18, 2007, 12:09:47 PM11/18/07

to

Paul Rubin <http://phr...@NOSPAM.invalid> writes:

> ste...@dino.dnsalias.com (Stephen J. Bevan) writes:
>> Indeed, so that leads us to what is a reasonable fraction and what
>> does it cost to achieve? There is no single answer to those
>> questions. However, I would argue that anyone who chooses pure shared
>> memory for performance reasons without benchmarking a (shared memory)
>> message passing solution is guilty of premature optimization.
>
> Well, I haven't seen any benchmarks yet comparing the two.

And in the absence of third-party benchmarks results what do you do?
The best thing is to do the benchmark yourself for your application.
If you can't/won't do that then what's the next logical choice? To me
it is obviously to use message passing, just as I'd obviously use a
higher level language over assembly unless there was a compelling case
for assembly.

> Anyway the best outcome is to find good abstractions that make pure
> shared memory easier to manage.

Message passing implemented on shared memory does that.

ross...@ps.uni-sb.de

unread,

Nov 18, 2007, 12:32:14 PM11/18/07

to

On 18 Nov., 03:56, Lauri Alanko <l...@iki.fi> wrote:
>
> Yes, that was my purpose: show how a traditional ref cell can be
> implemented in Erlang if you really wanto to. Though of course the
> Erlang ref is thread-safe.

Mh, I don't understand. In which sense is a traditional ML ref less
thread-safe?

- Andreas

Lauri Alanko

unread,

Nov 18, 2007, 12:41:04 PM11/18/07

to

In article <13k0ba4...@corp.supernews.com>,

Jon Harrop <use...@jdh30.plus.com> wrote:
> F#'s asynchronous workflows appear to provide what I described. In the
> simplest case, you can just take the let binding of an expression:
>
> let x = ...
>
> and have it evaluated concurrently:
>
> let! x = ...

Presumably x returns a future that will be forced only when x is
actually inspected?

> The code can still be purely functional. Any interaction between
> processes/threads/whatever is completely abstracted away.

Right. Similar constructs can be found in e.g. GHC and Alice.

But the reason these kinds of computations can be parallelized
automatically is that there really is no "interaction" as such: the
spawned process gets all its input upon startup, and it produces just
a single value. This is vastly different from the situations where
concurrency is required.

> > Perhaps you're after _parallelism_, i.e. distributing workload? This
> > can certainly be done in a purely functional setting:
> >
> > http://haskell.org/haskellwiki/GHC/Data_Parallel_Haskell
> >
> > But still, it has nothing to do with concurrency.
>
> Well, it provides a "concurrent, high-performance library of strict,
> segmented arrays, which we call package ndp".

All right, granted, perhaps the terminology isn't followed as
rigorously as I'd like. :) I'm using the terms as defined here:

http://www.haskell.org/ghc/docs/latest/html/users_guide/lang-parallel.html

And here:

http://www.haskell.org/ghc/docs/latest/html/users_guide/using-smp.html

There's a fine distinction between concurrency and
parallelism: parallelism is all about making your program run
faster by making use of multiple processors simultaneously.
Concurrency, on the other hand, is a means of abstraction: it
is a convenient way to structure a program that must respond
to multiple asynchronous events.

These are orthogonal concepts, since a multi-threaded program can be
run on a single processor with no parallelism, and a threadless
program can be run transparently on multiple processors (like in your
example).

It seems clear that you by this definition, you are talking about
parallelism, whereas this thread has (until now) been about
concurrency: if your problem is naturally represented as multiple
simultaneously running interacting processes, what's the best
discipline for programming them?

> Let me rephrase: how would you implement an efficient and concurrent set
> union?

This depends quite a bit both on the set presentation used and on the
number of nodes: are all processes in a single node so an immutable
data structure can be passed by reference, or are there multiple nodes
between which messages must be passed by value?

In any case, I can't think of any obvious answer offhand. It's a
difficult problem.

> I was thinking: if each data structure is owned by a thread, how do you
> combine data structures to create new ones? Set union is one example of
> doing that: I don't want to just fetch elements sequentially (that is
> inefficient).

If processes are located at the same node (like they are in
traditional multithreaded programming), then data structures aren't
"owned" by anyone: since they are immutable, references to them can be
cheaply passed to other processes.

The problem is that such usage of shared memory, even under the hood,
becomes relatively more and more expensive all the time. Processors
and cores get more and more complicated memory hierarchies, and access
to the shared RAM requires locks and cache flushes and whatnot. This
trend is likely to continue: processors aren't getting much faster any
more, there are just going to be more and more of them.

This is why message passing is now, more than ever, becoming the Right
Way to model concurrency: not only is it semantically far easier to
grasp than shared memory (especially when there are no transactions),
but it also presents the costs more faithfully: communicating between
processes _is_ becoming a very expensive operation. The
message-passing style makes it explicit when you are doing it and
encourages you to design your program to minimize inter-process
communication.

But designing efficient algorithms under such constraints is a new
challenge, and I certainly am no expert on that.

Lauri

Lauri Alanko

unread,

Nov 18, 2007, 12:55:14 PM11/18/07

to

In article <fa409b76-6c2d-47ad...@w34g2000hsg.googlegroups.com>,

I simply meant that ML as a language doesn't say anything about
concurrency. If we forget GC issues for the moment and assume that
someone has taken a correct SML compiler for single-threaded code and
simply slapped pthreads bindings on top of that to make several
threads run SML code in the same address space simultaneously, then
all bets are off about what happens when several threads access the
same ref at the same time, since all the code is compiled with the
assumption that no one else is accessing or modifying the contents of
the heap.

Effectively I just meant that "threads cannot be implemented as a
library": http://www.hpl.hp.com/techreports/2004/HPL-2004-209.html

Of course a reasonable SML implementation, if it is going to support
concurrency, should provide explicit semantics for concurrent accesses
to refs and other mutable objects, since undefined behavior would be
really un-SMLish.

But still, a _plain_ ML ref is by itself not threadsafe.

Lauri

Isaac Gouy

unread,

Nov 18, 2007, 1:01:35 PM11/18/07

to

On Nov 17, 5:06 pm, Paul Rubin <http://phr...@NOSPAM.invalid> wrote:

It's a puzzle but that couple of 32 core machines don't seem to have
been donated to the benchmarks game yet - maybe we need to register as
a charity :-)

http://shootout.alioth.debian.org/gp4sandbox/benchmark.php?test=threadring&lang=all

http://shootout.alioth.debian.org/gp4sandbox/benchmark.php?test=chameneos&lang=all

ross...@ps.uni-sb.de

unread,

Nov 18, 2007, 1:34:23 PM11/18/07

to

On 18 Nov., 18:55, Lauri Alanko <l...@iki.fi> wrote:
>
> Of course a reasonable SML implementation, if it is going to support
> concurrency, should provide explicit semantics for concurrent accesses
> to refs and other mutable objects, since undefined behavior would be
> really un-SMLish.
>
> But still, a _plain_ ML ref is by itself not threadsafe.

I still don't understand. How could it ever provide a "wrong"
semantics, or one that is not thread-safe? It certainly has to make
sure that access to refs is atomic, otherwise it would actually be
memory-unsafe (because accessing a partially updated ref is likely to
crash your process if it is a pointer).

So, access certainly has to be atomic. That is about all the Erlang
encoding gives you either. I don't see a difference.

Of course, a properly designed concurrent dialect of ML will also have
an atomic /exchange/ operation on references, in order to be able to
build thread-safe abstractions /on top/ of refs. But that's rather a
different issue (and your Erlang snippet didn't provide it either).

- Andreas

Neelakantan Krishnaswami

unread,

Nov 18, 2007, 1:31:17 PM11/18/07

to

In article <<7xzlxc3...@ruckus.brouhaha.com>>,
Paul Rubin <> wrote:
> Vesa Karvonen <vesa.k...@cs.helsinki.fi> writes:
>> Your analogy is bogus. Message passing between threads that have
>> shared memory can be made very efficient (i.e. no extra copying, just
>> update message queue with pointer to immutable message and signal the
>> recipient) --- orders of magnitude faster than passing messages
>> between separate processes (without shared memory) or machines.
>
> If it's that simple there must be lots of systems that actually do it
> that way. So I'd be interested in seeing some examples, especially
> examples with measured performance numbers. I'm concerned not only

> with the overhead of actual message passing, but with the latency of
> waiting for a response from the other process.

Transactional memory is the *most* expensive of the proposed
concurrency primitives, and the one that's *least* compatible with the
underlying machine model.

In order of most compatible with how computers actually work to least:

1. Message-passing with ownership transfer. Here, a message is
constructed as a data structure, and the ownership of the memory
associated with it is transferred from one process to another
when the message is sent.

Example: MSR's Singularity research OS

2. Message-passing with explicit copying. As above, except that
the data in the message is copied on message send.

Example: Erlang

3. Message-passing with purely functional data and a shared heap.
Message passing is just as cheap as it is in 1, but you need
parallel garbage collection to maintain the fiction of a
global heap. GC then becomes the (only) concurrency bottleneck
in your program.

Example: Concurrent ML

4. Multiple threads with shared heap, and access control via locks,
semaphores, monitors, and the usual zoo.

The expense is as #3, except that mutation means that you create
lots of extra concurrency bottlenecks for yourself on top of the
one imposed by global heap management.

Examples: C, Java, etc.

5. Software transactional memory. You have shared memory and threads,
and optimistic concurrency control to detect interference and
rollback and restart ruined computations. The expense is the
highest here, since the primitives have to do so much work, and
since their efficient behavior relies on very particular
assumptions about the workload.

Example: the STM libraries for Haskell and C#

Basically, NUMA is real. Global shared memory isn't, and every
generation the CPU vendors dramatically increase the complexity of
their cache coherence circuitry, and weaken their memory consistency
guarantees a bit more. Since this can't go on, it won't.

#1 is the slickest, because the cost of message passing is basically
the cost of transferring a single word between processes. There are
*no* costs here that don't arise from the underlying hardware.

#2 is what you do when you don't have the type structure to enforce
safe ownership tranfer, and if you're sending large messages back and
forth it can be painful to copy a message.

But: it shares with #1 a key, critical advantage -- strong process
isolation means that each process can get its own garbage collector,
which runs independently of all the other processes' garbage
collectors. So you can scale out to arbitrary numbers of processors
without having to radically re-architect your language implementation.

#3 isn't that bad -- the Cheng-Blelloch garbage collector scales
near-linearly out to about 64 processors -- especially since in a
purely functional code there aren't going to be any *other*
concurrency bottlenecks, and you don't have to pay for message
copying. IMO, it makes sense to have something like the Cheng-Blelloch
collector for each of the isolated processes, because that will let
you play nicely with highly parallelizable codes like NESL-style
programs.

IMO, 4 and 5 are basically hopeless. They provide an abstraction that
is out of touch with the underlying hardware, AND is harder to
program. Why would anyone possibly want this?

--
Neel R. Krishnaswami
ne...@cs.cmu.edu

David B. Benson

unread,

Nov 18, 2007, 1:52:50 PM11/18/07

to

On Nov 18, 9:55 am, Lauri Alanko <l...@iki.fi> wrote:
...

>
> Of course a reasonable SML implementation, if it is going to support
> concurrency, should provide explicit semantics for concurrent accesses
> to refs and other mutable objects, since undefined behavior would be
> really un-SMLish.
>

John Reppy's book has a chapter on the semantics of Concurrent ML for
a uni-core machine.

============
But back to the question of how costly message passing is, Joe
Armstrong's book simply states that it is very fast. Looking at
Reppy's implementation of CML in SML/NJ makes it clear that only a few
dozen (maybe twelve to twenty) machine instructions are required. So
there is essentially the same cost as for mutex locks, because that is
what the CML implementation does. This way of implementing message
passing concurrency clearly could take advantage of a multi-core
design.

The right way to think about concurrent threads is that each is
designed to accomplish an intellectually manageable portion of the
entire task, sending messages to other threads to acquire some
information and finally sending a reply to whichever thread asked for
some information. This approach makes designing each thread fairly
easy. The run-time cost is minor for applications which are not
primarily computationally intensive.

As for the set union example, not knowing the purpose of computing the
union, I'd design thusly: Two separate threads each hold a
(dynamically changing) set; a third thread sends a message to eahc of
the other two requesting a set and when the replys are received
computes the union by merging (or whatever functional algorithm you
prefer).

Lauri Alanko

unread,

Nov 18, 2007, 2:14:47 PM11/18/07

to

In article <df5747cc-e7f0-4fbc...@o6g2000hsd.googlegroups.com>,

<ross...@ps.uni-sb.de> wrote:
> I still don't understand. How could it ever provide a "wrong"
> semantics, or one that is not thread-safe? It certainly has to make
> sure that access to refs is atomic, otherwise it would actually be
> memory-unsafe (because accessing a partially updated ref is likely to
> crash your process if it is a pointer).

Right, that's the simplest requirement. But atomicity alone doesn't
yet buy much: it guarantees that when you access the ref you get
_some_ valid value, but usually you also want to be able to predict
something about the value you get.

For instance, here's Boehm's example from Section 4.1 of his paper,
translated to SML:

val x = ref 0
val y = ref 0

fun f () =
if !x = 1
then y := !y + 1
else ()

fun g () =
if !y = 1
then x := !x + 1
else ()

One could assume that if f and g are run, whether sequentially or
concurrently in separate threads, then afterwards !x = 0 and !y = 0,
since both the if-tests must fail.

However, an SML compiler could quite legitimately (for bizarre
optimization purposes, or just for fun) transform the functions into:

fun f' () =
(y := !y + 1;
if !x <> 1
then y := !y - 1
else ())

fun g' () =
(x := !x + 1;
if !y <> 1
then x := !x - 1
else ())

In a single-threaded environment this is a valid transformation, but
if we run both the functions concurrently we may end up with !x = 1
and !y = 1, hardly something that we would have expected with the
original code.

So, a multithreaded SML implementation must not only ensure that
accesses to refs are atomic, but also that ordering of those
accesses is preserved during translation. Even then, there are quite a
number of subtleties that are not at all obvious. Just look at all the
work that the Java memory model required.

The Erlang version has none of these problems, because each process is
the sole mutator of its own state, and code may be optimized with this
assumption in mind. (Not that Erlang processes even have much state in
the first place.) The interactions between the processes are done by
explicit sequence passing, and those are guaranteed not to be
reordered in a way that would change their semantics.

Lauri

Lauri Alanko

unread,

Nov 18, 2007, 2:44:51 PM11/18/07

to

In article <slrnfk117l...@gs3106.sp.cs.cmu.edu>,

Neelakantan Krishnaswami <ne...@cs.cmu.edu> wrote:
> 5. Software transactional memory. You have shared memory and threads,
> and optimistic concurrency control to detect interference and
> rollback and restart ruined computations. The expense is the
> highest here, since the primitives have to do so much work, and
> since their efficient behavior relies on very particular
> assumptions about the workload.
>
> Example: the STM libraries for Haskell and C#

> IMO, 4 and 5 are basically hopeless. They provide an abstraction that

> is out of touch with the underlying hardware, AND is harder to
> program. Why would anyone possibly want this?

STM is certainly expensive, but I'm not convinced that it's _that_
hard to program in. STM is effectively a way to get semantically
serialized (i.e. exclusive) access to shared state without actually
blocking all the other threads. Serialization is still somewhat
understandable: it's the same level of consistency you get with
cooperative uniprocessor threads. (Granted, that's probably the _only_
shared-memory model that I consider understandable.)

And though STM's expensiveness stems from the fact that it manages
global state, this also makes it powerful: message-passing doesn't
have direct support for transactions like STM does. If you want to do
the bank transfer example (atomically withdraw some amount from one
account and deposit it to another), then without transactions it's
non-trivial to ensure both atomicity and lack of deadlocks if the
accounts are managed by different processes. If there's some standard
solution I'd like to hear it.

Lauri

Joachim Durchholz

unread,

Nov 18, 2007, 3:20:11 PM11/18/07

to

Florian Weimer schrieb:
> I don't know what to do in the case that is bound by per-CPU I/O
> bandwidth, this is difficult to parallelize using any method.

Is that a relevant case?

Regards,
Jo

Joachim Durchholz

unread,

Nov 18, 2007, 3:24:55 PM11/18/07

to

Paul Rubin schrieb:

> Are you saying that message passing with
> futexes comes within a factor of 2 of shared memory concurrency,
> including on memory-access-intensive classes of problems?

If you use shared memory, you need a semaphore somewhere.
On Linux, the fastest available variant of a semaphore is a futex.
In other words, there's no real difference.

Regards,
Jo

Paul Rubin

unread,

Nov 18, 2007, 3:25:58 PM11/18/07

to

ste...@dino.dnsalias.com (Stephen J. Bevan) writes:
> > Well, I haven't seen any benchmarks yet comparing the two.
> And in the absence of third-party benchmarks results what do you do?

Make the best estimates I can based on how the underlying mechanisms
work. I.e. a shared memory transaction involves a single
compare-and-swap instruction plus a few other instructions that don't
lock the bus. The message passing approach involves waiting for a
response from the other process. That either means context switches
or dedicating an entire cpu to the other process, which may then have
to spin away on multiple message queues.

> The best thing is to do the benchmark yourself for your application.

This is a language implementation discussion, it's not about a
specific application, it's about how to deal with this problem in
general, including for the difficult situations whatever those turn
out to be. For that matter, benchmarks aren't really possible, though
simulations might be. This is because we're trying to figure out the
best way to handle hardware designs that are still coming down the
pike and aren't yet available to benchmark.

> If you can't/won't do that then what's the next logical choice? To me
> it is obviously to use message passing, just as I'd obviously use a
> higher level language over assembly unless there was a compelling case
> for assembly.

There is rarely a case for assembly, based on decades of experimental
measurements showing that good compilers can do almost as well as
assembly, even for the demanding cases. The non-demanding cases
aren't interesting (those are the cases that can use languages like
Python). I mean, really, your advice is sort of vacuous. If my
application isn't demanding then I should just code it in Python in a
single thread on one processor. That's sufficient a heck of a lot of
the time. Once I'm interested in parallelism at all, I've already
departed from the space of applications where I can stand big
performance losses due to the concurrency mechanism.

> > Anyway the best outcome is to find good abstractions that make pure
> > shared memory easier to manage.
> Message passing implemented on shared memory does that.

It's not established whether the performance hit is tolerable in the
demanding cases. That means comparable to going from assembler to C,
not going from assembler to Python.

Thomas Lindgren

unread,

Nov 18, 2007, 3:40:40 PM11/18/07

to

Neelakantan Krishnaswami <ne...@cs.cmu.edu> writes:

> 2. Message-passing with explicit copying. As above, except that
> the data in the message is copied on message send.
>
> Example: Erlang

Well, to be precise, Erlang does copying on the conceptual
level. There are non-copying and partially-copying Erlang
implementations, and even the standard implementation shares some data
between same-VM processes. Such tradeoffs are really not language
issues, but implementation issues.

Best,
Thomas
--
Thomas Lindgren "Everybody wants to be a god. Who is
willing to be the plaything of others?"
-- Gao Xinjiang

Paul Rubin

unread,

Nov 18, 2007, 3:43:04 PM11/18/07

to

Joachim Durchholz <j...@durchholz.org> writes:
> > I don't know what to do in the case that is bound by per-CPU I/O
> > bandwidth, this is difficult to parallelize using any method.
> Is that a relevant case?

I always heard this is the situation with the number field sieve, for
example.

Paul Rubin

unread,

Nov 18, 2007, 3:46:14 PM11/18/07

to

Joachim Durchholz <j...@durchholz.org> writes:
> If you use shared memory, you need a semaphore somewhere.
> On Linux, the fastest available variant of a semaphore is a futex.
> In other words, there's no real difference.

The issue is what happens at the other end of the semaphore. I can
understand the actual message passing to be very fast, but how do you
organize the remote process to respond to the message instantaneously?
And don't you now need another message to send the response? What if
you want to do several operations on the shared structure--do you need
a message pair for each operation?

Lauri Alanko

unread,

Nov 18, 2007, 4:06:45 PM11/18/07

to

In article <7x8x4vq...@ruckus.brouhaha.com>,

Paul Rubin <http://phr...@NOSPAM.invalid> wrote:

> The issue is what happens at the other end of the semaphore. I can
> understand the actual message passing to be very fast, but how do you
> organize the remote process to respond to the message instantaneously?

You don't. You leave your message in the remote process's mailbox, and
it will read it when it has a spare moment.

> And don't you now need another message to send the response?

Only if you require a response. Often it is enough just to send the
message and rest assured with the knowledge that the other process
will eventually read it.

Message-passing is _asynchronous_. You can implement a synchronous
protocol on top of it, but that's not very efficient, as you point
out.

> What if
> you want to do several operations on the shared structure--do you need
> a message pair for each operation?

Optimally you structure your program so that the combination of
several operations can be done by the remote process with a single call.
For further flexibility, you can send to the remote process a function
that operates on its data structure. In distributed systems this
requires code marshalling, but that's not unheard of.

Lauri

Paul Rubin

unread,

Nov 18, 2007, 4:09:37 PM11/18/07

to

Neelakantan Krishnaswami <ne...@cs.cmu.edu> writes:
> 1. Message-passing with ownership transfer. Here, a message is
> constructed as a data structure, and the ownership of the memory
> associated with it is transferred from one process to another
> when the message is sent.

Hmm, you're saying the message is some big data structure consed out
of nothingness? If yes, is that better than copying? If not, what
happens if it shares structure with something else in the local
process?

> 3. Message-passing with purely functional data and a shared heap.
> Message passing is just as cheap as it is in 1, but you need
> parallel garbage collection to maintain the fiction of a
> global heap. GC then becomes the (only) concurrency bottleneck
> in your program.

I agree this is very clean conceptually. But, I think you either need
a lot more messages than shared-memory approaches would need, or else
you have to push what would be local computation into the remote
process, sacrificing parallelism.

> 5. Software transactional memory. You have shared memory and threads,
> and optimistic concurrency control to detect interference and
> rollback and restart ruined computations. The expense is the
> highest here, since the primitives have to do so much work, and
> since their efficient behavior relies on very particular
> assumptions about the workload.

In the uncontended case the primitives don't do much, and measurements
in the original papers show STM beats the traditional locking approach
on some big SMP machines. Maybe there are specific workloads that can
make STM lose to message passing.

> #2 is what you do when you don't have the type structure to enforce
> safe ownership tranfer, and if you're sending large messages back and
> forth it can be painful to copy a message.

This is really interesting, are you saying #1 uses something like
linear types to keep track of which data belongs to which process?

Vesa Karvonen

unread,

Nov 18, 2007, 4:25:59 PM11/18/07

to

Lauri Alanko <l...@iki.fi> wrote:
> In article <7x8x4vq...@ruckus.brouhaha.com>,
> Paul Rubin <http://phr...@NOSPAM.invalid> wrote:
> > The issue is what happens at the other end of the semaphore. I can
> > understand the actual message passing to be very fast, but how do you
> > organize the remote process to respond to the message instantaneously?

> You don't. You leave your message in the remote process's mailbox, and
> it will read it when it has a spare moment.

That would be the typical situation.

> > And don't you now need another message to send the response?

Not necessarily --- although that would depend on your definitions.
It is easy to imagine and implement a "rendezvous channel", where
there are no separate send/recv operations, but a single rendezvous
operation that sends and receives at the same time. When one party
arrives before the other, it is blocked until the other party arrives
to exchanges messages.

> Message-passing is _asynchronous_.

Not necessarily. For example, Concurrent ML provides synchronous
message passing
(http://cml.cs.uchicago.edu/pages/cml.html#SIG:CML.chan:TY). This
means that the sender is blocked until the recipient is ready to
receive the message. Of course, synchronous channels are not the only
primitive that CML offers.

> You can implement a synchronous protocol on top of it, but that's
> not very efficient, as you point out.

It is trivial to implement asynchronous message passing on top of
synchronous message passing: just spawn a thread to send the message.
(In CML, spawning a thread is cheap.) However, CML also provides
asynchronous message passing
(http://cml.cs.uchicago.edu/pages/mailbox.html).

-Vesa Karvonen

Stephen J. Bevan

unread,

Nov 18, 2007, 4:47:35 PM11/18/07

to

Paul Rubin <http://phr...@NOSPAM.invalid> writes:

> ste...@dino.dnsalias.com (Stephen J. Bevan) writes:
>> > Well, I haven't seen any benchmarks yet comparing the two.
>> And in the absence of third-party benchmarks results what do you do?
>
> Make the best estimates I can based on how the underlying mechanisms
> work. I.e. a shared memory transaction involves a single
> compare-and-swap instruction plus a few other instructions that don't
> lock the bus. The message passing approach involves waiting for a
> response from the other process. That either means context switches
> or dedicating an entire cpu to the other process, which may then have
> to spin away on multiple message queues.

Your "i.e." and all that follows is making the same claim that brought
me into this thread, which I addressed in my first response
(<87abpcg...@dnsalias.com>). Your followup didn't challenge
anything I wrote on the particulars of how the locking implementations
would work instead you've apparently ignored it.

If you are convinced that message passing does not have tolerable
performance then pick an application that you think will exhibit this.
If you are willing to write the shared memory versionn I'll write a
message passing version -- I'll write it first if the application
description is sufficiently detailed. I suggest using C since it has
no runtime to speak of and so all the details are transparent. Once
done you can benchmark them on whatever SMP systems you like.

Florian Weimer

unread,

Nov 18, 2007, 4:47:49 PM11/18/07

to

* Joachim Durchholz:

I think garbage collection falls into this category, at least in some
cases.

Paul Rubin

unread,

Nov 18, 2007, 5:26:54 PM11/18/07

to

ste...@dino.dnsalias.com (Stephen J. Bevan) writes:
> Your "i.e." and all that follows is making the same claim that brought
> me into this thread, which I addressed in my first response
> (<87abpcg...@dnsalias.com>). Your followup didn't challenge
> anything I wrote on the particulars of how the locking implementations
> would work instead you've apparently ignored it.

In that response, you wrote:

Under Linux you can implement interprocess message passing using a
futex to protect the message queue and shared memory to hold the
messages. In this case context switching only occurs if there is
contention on the futex or you don't have enough physical CPUs to
support the processes you want to run. Both issues can occur in a
concurrent shared memory program with application specific locks. So,
the difference would come down to lock contention rates and some
additional memory&time to handle the queues.

I thought I made it clear, I'm concerned about the latency of waiting
for an answer to the message. That would be "you don't have enough
physical CPUs to support the process you want to run". The idea of
using shared memory is to have fewer processes, i.e. one process per
CPU, instead of dedicating processes to basically act as gatekeepers
for data structures. Either you're keeping the gatekeeper processes
busy all the time, in which case you're going to have to wait for them
to get around to particular messages; or you're not keeping them busy,
in which case you're not utilizing your cpu hardware completely.

> If you are convinced that message passing does not have tolerable
> performance then pick an application that you think will exhibit this.
> If you are willing to write the shared memory versionn I'll write a
> message passing version -- I'll write it first if the application
> description is sufficiently detailed. I suggest using C since it has
> no runtime to speak of and so all the details are transparent. Once
> done you can benchmark them on whatever SMP systems you like.

I'll think about this. Maybe there's already some code around?
What about the measurements that Keir Fraser did on STM vs locking?
That code is around, I think. A message passing comparison would be
interesting.

Where will the SMP systems come from? I can get at plenty of dual
core Athlon boxes and a few quad core, but nothing bigger than that.

Lauri Alanko

unread,

Nov 18, 2007, 5:56:23 PM11/18/07

to

In article <7xmytby...@ruckus.brouhaha.com>,

Paul Rubin <http://phr...@NOSPAM.invalid> wrote:

> I thought I made it clear, I'm concerned about the latency of waiting
> for an answer to the message.

Why is latency such a concern for you? Are you thinking of real-time
applications?

> Either you're keeping the gatekeeper processes
> busy all the time, in which case you're going to have to wait for them
> to get around to particular messages; or you're not keeping them busy,
> in which case you're not utilizing your cpu hardware completely.

That is a truism: if you have a full load, then some processes must
wait for their turn to run. If all your processes are churning away at
full speed, then you have don't have a full load. But this is true no
matter what your concurrency model is. So what does this have to do
with message passing?

You somehow seem to be implying that with lots of small processes
scheduling would be somehow worse. But I don't see what this is based
on: the smaller your processes, the finer the granularity with which
you can control the scheduling.

Lauri

Paul Rubin

unread,

Nov 18, 2007, 6:27:35 PM11/18/07

to

Lauri Alanko <l...@iki.fi> writes:
> Why is latency such a concern for you? Are you thinking of real-time
> applications?

No I'm thinking of more like algorithms involving traversing data
structures, where you look at some value, use it to determine which
value you want to look at next, etc.

> That is a truism: if you have a full load, then some processes must
> wait for their turn to run. If all your processes are churning away at
> full speed, then you have don't have a full load. But this is true no
> matter what your concurrency model is. So what does this have to do
> with message passing?

You want a full load with all processes churning away. Are we using
the word process to mean the same thing? See for example

http://www.kegel.com/c10k.html

for how to write highly concurrent servers like this.

> You somehow seem to be implying that with lots of small processes
> scheduling would be somehow worse. But I don't see what this is based
> on: the smaller your processes, the finer the granularity with which
> you can control the scheduling.

If you run more processes than you have cpu's, then you're burning
cycles in context switching. Ideally you want one process per cpu so
you rarely context switch.

Marshall

unread,

Nov 18, 2007, 7:01:10 PM11/18/07

to

On Nov 18, 10:31 am, Neelakantan Krishnaswami <ne...@cs.cmu.edu>
wrote:

>
> IMO, 4 and 5 are basically hopeless. They provide an abstraction that
> is out of touch with the underlying hardware, AND is harder to
> program. Why would anyone possibly want this?

Your analysis confuses me. The claim that #5 is harder to program
seems to me exactly the opposite of what is true. In fact it
seems to me demonstrably the case that transactions are
the simplest, easiest way to handle concurrency--this strikes
me as its most important advantage.

In your 1, 2, and 3, every mutation operation requires the
creation of an ad hoc application-specific mutation message.
This is simply a huge expense in terms of programmer
productivity that is unnecessary with 4 and 5. The problems
with 4 are well-known; I do not champion it, however
5 does not have any of those problems. The *only*
problems that 5 has, as far as the complexity of the
software construction process, are the fundamental,
irreducible ones. Which is why I say it is not merely
the simplest model extant, but the simplest model
possible.

Marshall

Stephen J. Bevan

unread,

Nov 18, 2007, 7:02:02 PM11/18/07

to

Paul Rubin <http://phr...@NOSPAM.invalid> writes:

> I thought I made it clear, I'm concerned about the latency of waiting
> for an answer to the message.

That's a reasonable concern if :-

1. you have real-time requirements.

2. you write in a synchronous style and there is no easy way to do
other useful work while you are blocked waiting for a reply.

3. You write in any syle but only have a few processes and they work
in a very synchronous manner.

The first case takes special attention no matter what model you use.
Regarding the second case there are two alternatives :-

1. Write your code using asynchronous message passing so that you
don't wait. The assumption here is that there is always other
useful work to do i.e. something else in the message queue to do or
where the runtime will run another "process" without any kernel
intervention. Call this the Erlang model.

2. Stick with synchronous message passing but couple it with the cheap
process approach of above. Call this the CML (or Ada) model.

Regarding the third case then there is not much hope.

> I'll think about this. Maybe there's already some code around?

Maybe, I'm giving you the choice so that it can't be said that the
test is rigged in favour of message passing.

> What about the measurements that Keir Fraser did on STM vs locking?
> That code is around, I think. A message passing comparison would be
> interesting.

Do you mean the measurements in "Language Support for Lightweight
Transactions" which involve a hash table or passing tokens around? If
so, then the results are only interesting in this context if the
applications you are intersted in are such that the amount of CPU you
consume processing the data is of the same order as it costs to lock
and unlock that data. If those are the types of application you are
intested in and high CPU utilization is a concern then then message
passing. If on the other hand you spend say 10x as much CPU
processing the data you receive as it takes to lock/unlock it then the
lock/unlock cost is irrelevant.

> Where will the SMP systems come from? I can get at plenty of dual
> core Athlon boxes and a few quad core, but nothing bigger than that.

That's fine unless you need 8, 16, 32-way boxes to make your case for
pure shared memory.

ross...@ps.uni-sb.de

unread,

Nov 19, 2007, 3:48:19 AM11/19/07

to

On 18 Nov., 20:14, Lauri Alanko <l...@iki.fi> wrote:
> In article <df5747cc-e7f0-4fbc-aba4-1835545d7...@o6g2000hsd.googlegroups.com>,

>
> So, a multithreaded SML implementation must not only ensure that
> accesses to refs are atomic, but also that ordering of those
> accesses is preserved during translation. Even then, there are quite a
> number of subtleties that are not at all obvious. Just look at all the
> work that the Java memory model required.

The concurrent semantics of state has been formally specified for at
least some concurrent extensions of ML (e.g. CML, Alice ML). Does such
a spec even exist for Erlang? There are several multi-threaded ML
implementations out there, and as far as I know, none has been plagued
by the problems you describe.

Even lacking a specific spec for concurrent state, the big-step
semantics of the SML Definition makes the ordering of state changes
very explicit. And while it is true that technically it does not
actually impose that order as long as it is not observable in the
outcome, any extension with concurrency /makes/ the order observable,
and thus an implementation that does not implement it would simply be
broken IMO.

In other words, I would certainly assume the same guarantees about
global state changes in ML as you do about message passing in Erlang.
Everything else seems to be splitting hairs.

Of course, the issues you bring up are more serious in less rigorously
defined languages.

- Andreas

Paul Rubin

unread,

Nov 19, 2007, 4:11:33 AM11/19/07

to

ste...@dino.dnsalias.com (Stephen J. Bevan) writes:

> That's a reasonable concern if :-
> 1. you have real-time requirements.

I wasn't thinking of hard real time (e.g. motor control), just normal
interactive programs, so most approaches should be ok.

> 2. you write in a synchronous style and there is no easy way to do
> other useful work while you are blocked waiting for a reply.

Yes I think this is the situation I'm concerned with.

> 3. You write in any syle but only have a few processes and they work
> in a very synchronous manner.

Right, the idea is to use as many processes as there are physical
cpu's, to cut down on context switching.

> > I'll think about this. Maybe there's already some code around?
> Maybe, I'm giving you the choice so that it can't be said that the
> test is rigged in favour of message passing.

But I'd look for one rigged in favor of shared memory, so it's better
to look for real-world cases.

>
> > What about the measurements that Keir Fraser did on STM vs locking?

> Do you mean the measurements in "Language Support for Lightweight

> Transactions" which involve a hash table or passing tokens around? If
> so, then the results are only interesting in this context if the
> applications you are intersted in are such that the amount of CPU you
> consume processing the data is of the same order as it costs to lock
> and unlock that data. If those are the types of application you are
> intested in and high CPU utilization is a concern then then message
> passing.

I can't quite parse your last sentence, but yes, that's the cpu
profile I'm imagining.

> That's fine unless you need 8, 16, 32-way boxes to make your case for
> pure shared memory.

Fraser tested on something like a 100-way box so it's of concern.

Vesa Karvonen

unread,

Nov 19, 2007, 6:08:58 AM11/19/07

to

Paul Rubin <http://phr...@nospam.invalid> wrote:
> ste...@dino.dnsalias.com (Stephen J. Bevan) writes:

[...]

> > 2. you write in a synchronous style and there is no easy way to do
> > other useful work while you are blocked waiting for a reply.

> Yes I think this is the situation I'm concerned with.

> > 3. You write in any syle but only have a few processes and they work
> > in a very synchronous manner.

> Right, the idea is to use as many processes as there are physical
> cpu's, to cut down on context switching.

Let me clarify a bit. Most higher-level concurrent languages provide
you with very cheap threads. These are essentially user-level threads
that are managed by the language implementation rather than the
kernel. With a good implementation, a user-level thread takes very
little memory and you can have hundreds of thousands of such threads
on current hardware without problems. The language implementation
uses a few (e.g. one per core) kernel threads to run those user-level
threads. At any moment, each of those kernel threads is running a
user-level thread. When a user-level thread blocks, the corresponding
kernel thread does not usually block, but just starts running another
user-level thread. Switching between user-level threads happens
entirely in the user program and is typically much faster than
switching between kernel threads (and processes). The point here is
that even if you would use only synchronous message passing, you can
add as much asynchrony as you want by spawning more threads.

-Vesa Karvonen

Marshall

unread,

Nov 19, 2007, 10:47:41 AM11/19/07

to

On Nov 19, 3:08 am, Vesa Karvonen <vesa.karvo...@cs.helsinki.fi>
wrote:

> Paul Rubin <http://phr...@nospam.invalid> wrote:

I've long thought that was a pretty interesting model, but the
thing that always worries me is runaway threads. In practice,
how much of an issue is that?

Marshall

David B. Benson

unread,

Nov 19, 2007, 1:35:10 PM11/19/07

to

If you mean threads which are compuationally intense, there is little
problem, at least in CML. Once each 10 or 20 milliseconds (typical
choices, but programmer adjuatable), the ready queue is inspected and
if non-emoty, the top thread is enabled instead. So it is a non-issue.

Neelakantan Krishnaswami

unread,

Nov 19, 2007, 6:05:03 PM11/19/07

to

In article <<7x4pfjq...@ruckus.brouhaha.com>>,

Paul Rubin <> wrote:
> Neelakantan Krishnaswami <ne...@cs.cmu.edu> writes:
>> 1. Message-passing with ownership transfer. Here, a message is
>> constructed as a data structure, and the ownership of the memory
>> associated with it is transferred from one process to another
>> when the message is sent.
>>

>> #2 is what you do when you don't have the type structure to enforce
>> safe ownership tranfer, and if you're sending large messages back and
>> forth it can be painful to copy a message.
>

> Hmm, you're saying the message is some big data structure consed out
> of nothingness? If yes, is that better than copying? If not, what
> happens if it shares structure with something else in the local
> process?
>

> This is really interesting, are you saying #1 uses something like
> linear types to keep track of which data belongs to which process?

Yes, that's right. That's what maintains strong process isolation.

To understand why isolation is important, it's good to look at an
idealized model of a multicore machine:

CPU 1 . . . CPU n
| |
Cache 1 Cache n
| |
... global memory ...

So we have a global memory, which n CPUs act on. The CPUs each have a
cache to speed access to the global memory. The existence of a cache
means that whenever a CPU does a write to a memory location, all of
the other cores need to be notified, in case they also have that
location in their cache. Otherwise the CPUs might not see a consistent
view of the heap.

To make this communication fast, each CPU needs to be connected to all
the others, which is an O(n^2) dependency. This is an awful lot of
communication to require on every write, which is why hardware vendors
have been steadily weakening the memory consistency guarantees they
offer us programmers. Another way of describing this is that the
abstraction of global shared memory requires a lot of message passing
under the hood -- communication that a message passing architecture a)
makes explicit, and b) potentially lets you avoid doing when you don't
need to.

Now, note that if each CPU has exclusive access to part of the global
heap, then there are no cache consistency problems, since we know that
other CPUs will not make use of that part of the heap. In Singularity,
when processes communicate, they transfer ownership of the message
data when they send a message. This means that even though some data
may be in a process's cache, it won't look at it again in the future.

This is all important, because at the level of hardware, even
functional languages do a lot of mutation. Whenever you use a copying
garbage collector, the pointers representing values change, which
looks like an update from the point of view of a cache!

Erlang doesn't have linearly typed messages, so at a conceptual level
it copies the whole message from one process's heap into another
process's. This also maintains the same process isolation invariant,
at a slightly higher cost.

Now, in the short term, our hardware does have a lot of cache
consistency circuitry, and it's foolish to ignore it. For example,
Thomas Lindgren mentioned in another post that some implementations of
Erlang use a global heap under the covers, and don't actually copy
messages sent between processes on the same node.

This state of affairs won't last, though, and it's unwise to base a
language design on the idea it will always be there. Erlang is future
proof in this regard -- the more I learn about it, the more its design
impresses me. It's hard to make a language that can survive profound
architectural changes with its performance model intact!

>> 3. Message-passing with purely functional data and a shared heap.
>> Message passing is just as cheap as it is in 1, but you need
>> parallel garbage collection to maintain the fiction of a
>> global heap. GC then becomes the (only) concurrency bottleneck
>> in your program.
>
> I agree this is very clean conceptually. But, I think you either need
> a lot more messages than shared-memory approaches would need, or else
> you have to push what would be local computation into the remote
> process, sacrificing parallelism.

Message passing only makes explicit the communication that shared
memory is already doing.

Note that when I say "bottleneck", I mean that there's a global,
implementation-imposed limit on the parallel speedup a program can
achieve. So even if you have a ridiculously parallel problem, you
might not be able to get full utilization.

Any shared-memory design will face this as a fundamental problem.
Even languages that are purely functional at the user level may face
this problem if they need garbage collection.

>> 5. Software transactional memory. You have shared memory and threads,
>> and optimistic concurrency control to detect interference and
>> rollback and restart ruined computations. The expense is the
>> highest here, since the primitives have to do so much work, and
>> since their efficient behavior relies on very particular
>> assumptions about the workload.
>
> In the uncontended case the primitives don't do much, and measurements
> in the original papers show STM beats the traditional locking approach
> on some big SMP machines. Maybe there are specific workloads that can
> make STM lose to message passing.

Any time the assumptions underlying optimistic concurrency control
break, STM loses big time. The easiest way to lose is if you have any
long-lived transactions. This makes it quite dangerous to actually
*use* the compositionality of STM. :(

Stephen J. Bevan

unread,

Nov 19, 2007, 9:38:35 PM11/19/07

to

Paul Rubin <http://phr...@NOSPAM.invalid> writes:

> ste...@dino.dnsalias.com (Stephen J. Bevan) writes:
> I wasn't thinking of hard real time (e.g. motor control), just normal
> interactive programs, so most approaches should be ok.
>
>> 2. you write in a synchronous style and there is no easy way to do
>> other useful work while you are blocked waiting for a reply.
>
> Yes I think this is the situation I'm concerned with.

Concerned enough to describe an actual problem that can be coded up
and measured?

>> 3. You write in any syle but only have a few processes and they work
>> in a very synchronous manner.
>
> Right, the idea is to use as many processes as there are physical
> cpu's, to cut down on context switching.

My use of "process" here was ambigous. I'm using "process" to to mean
either a kernel-level process or a user-level process. The former are
what are scheduled by the kernel and will require a context switch and
a TLB flush on crappy hardware. The latter only exist at the
user-level and are totally under control of the application
(framework) as to when one is run. You want at least one kernel-level
process per CPU and these processes don't necessarily correspond to
anything in the application domain. If you design your application as
communicating sequential processes then you may have hundreds or even
thousands of user-level processes spread across the kernel level
processes. With that many user-level processes, there should always
be more than one ready with some work so the kernel process that hosts
them is never idle/blocked.

>> > I'll think about this. Maybe there's already some code around?
>> Maybe, I'm giving you the choice so that it can't be said that the
>> test is rigged in favour of message passing.
>
> But I'd look for one rigged in favor of shared memory, so it's better
> to look for real-world cases.

By all means offer one rigged in favour of shared memory, I'd like to
see it.

>> applications you are intersted in are such that the amount of CPU you
>> consume processing the data is of the same order as it costs to lock
>> and unlock that data. If those are the types of application you are
>> intested in and high CPU utilization is a concern then then message

>> passing [ is not for you -sjb ].

>
> I can't quite parse your last sentence, but yes, that's the cpu
> profile I'm imagining.

If you are spending the same amount of time locking/unlocking the data
as you actually spend working on it, then 50% of the time is taking in
locking and not doing real work. That's certainly full CPU utiliztion
but with only 50% being application work it doesn't seem like a good
tradeoff to me.

>> That's fine unless you need 8, 16, 32-way boxes to make your case for
>> pure shared memory.
>
> Fraser tested on something like a 100-way box so it's of concern.

Sure, it is a concern to people who own 100-way boxes, but it is going
to be quite a while before generic server or desktop code has to worry
about how to program them effectively. Thus, unless that's the the
only domain you are interested, it doesn't seem worth worrying about
how to program them effectively.

Stephen J. Bevan

unread,

Nov 19, 2007, 9:54:52 PM11/19/07

to

Marshall <marshal...@gmail.com> writes:
> I've long thought that was a pretty interesting model, but the
> thing that always worries me is runaway threads. In practice,
> how much of an issue is that?

If you man "runaway" in that it uses too much CPU so that it blocks
the ability to service other messages then there are three obvious
solutions :-

1. adopt the real-time mindset when designing the application and
either don't write code that takes too much CPU or partition it so
that it can run on another CPU. The latter is what I do to handle
software based RSA/Diffie-Hellman in IKE/SSL. Hardware crypto can
be handled by the same API, the only difference is that the answer
comes back a lot quicker.

2. use a timer to interrupt the kernel-process and have it re-schedule
the user-level process. This can be done on an ad hoc basis in the
application but it is obviously simpler for the application
programmer if it is handled by the language runtime.

3. have the compiler generate checks on function entry to see if too
much real or virtual time has been used and if so, re-schedule.

I believe at least one implementation of Erlang uses 3.

Ulf Wiger

unread,

Nov 20, 2007, 4:42:26 AM11/20/07

to

ross...@ps.uni-sb.de wrote:
> On 18 Nov., 20:14, Lauri Alanko <l...@iki.fi> wrote:
>> In article <df5747cc-e7f0-4fbc-aba4-1835545d7...@o6g2000hsd.googlegroups.com>,
>>
>> So, a multithreaded SML implementation must not only ensure that
>> accesses to refs are atomic, but also that ordering of those
>> accesses is preserved during translation. Even then, there are quite a
>> number of subtleties that are not at all obvious. Just look at all the
>> work that the Java memory model required.
>
> The concurrent semantics of state has been formally specified for at
> least some concurrent extensions of ML (e.g. CML, Alice ML). Does such
> a spec even exist for Erlang? There are several multi-threaded ML
> implementations out there, and as far as I know, none has been plagued
> by the problems you describe.

I guess this would be a good starting point to investigate
what semantics have been specified for Erlang:

http://portal.acm.org/citation.cfm?id=1292520.1292528

In short, Lars-Åke Fredlund's PhD thesis from 2001 presented a formal
semantics for Erlang, sans the distributed messaging part.
This was complemented with distributed message passing semantics in 2005
by Koen Claessen and Hans Svensson, and then improved in 2007 by
Svensson and Fredlund. The improvements were mainly identification of
nasty failure cases, e.g. due to misbehaving communication channels.
(For the animated slide version, see
http://www.erlang.se/workshop/2007/proceedings/06svenss.ppt)

In my limited understanding of CML, focus was generally on coming
up with a concurrency model that would not violate the type semantics
of ML. Erlang's focus was decidedly different, as building commercial
products with in-service code change, heterogeneous distributed
environments and control of low-level devices were important initial
requirements. The functional aspects of Erlang were introduced as a
result of practical experimentation designing telephony control software
using a variety of different languages, representing different
programming paradigms.

There are of course drawbacks with such a pragmatic approach, but it did
have the great advantage of enabling the use of Erlang (such as it was
then) in commercial prototypes already in 1988. Perhaps noteworthy is
that "functional programming" was not really mentioned much in the
beginning - the focus was more on "declarative programming".

In our experience, which is much colored by the fact that we encounter
just about all the different concurrency challenges in our product
development, there are some different areas of concurrency that each
pose very different challenges to the language designer:

1) Hard real-time semantics. This is about meeting deadlines, and
failing to do so is considered a catastrophic event. The concurrency
model must ensure state reachability at all cost, and the price paid is
usually modeling complexity, as the full state-event matrix must be
considered. This is not my field, but I've observed that many of the
more innovative languages (e.g. Esterel) are quite declarative.

2) Soft real-time semantics. Here, statistically good behaviour is
sufficient, and it is not considered catastrophic if some jobs fail to
complete. On the other hand, the complexity of the system can often be
such that it defies strict modeling due to the enormous state space.
This is Erlang's domain, and I imagine that many interesting web service
applications will exhibit similar characteristics.

3) High-performance computing. This is about making the most of parallel
hardware (as far as concurrency is concerned), and techniques to enable
the compiler to exploit inherent parallelism are of course interesting.
As far as I understand, research in automatically exploiting parallelism
in traditional languages has produced bleak results, but perhaps modern,
high-level languages like Haskell and ML can offer abstractions that let
the programmer give the compiler just the right information to squeeze
the most performance possible out of the HW architecture.

Discussing concurrency without clearly describing the problem domain
becomes rather confusing, IMO. My area of expertise is (2), and I've
grown accustomed to people criticizing constructs in Erlang on the
grounds that they do not offer sufficient guarantees, but failing to
recognize that techniques that /do/ propose to offer guarantees do not
scale to the level of complexity at hand, or fail to consider real-world
failure scenarios in heterogeneous environments. The designers of Erlang
have had to find abstractions that offer decent trade-offs, and which
feel intuitive to programmers with little or no formal Comp Sci
training. I'm sure that better abstractions will come along, that
perhaps force the designer to fewer trade-offs.

Bringing the post back on topic (concurrency in the functional world),
we've found that functional programming and share-nothing
message-passing concurrency are a very good combination. The conceptual
model is basically CSP, for both Erlang and CML. One of the main
advantages of this model is that many concurrency algorithms found in
human organizations (where concurrency is dealt with continuously) map
readily onto a CSP model. This means that just about all designers can
draw on a wealth of personal experience dealing with tricky concurrency.
This pays off especially in (2) above, where such pragmatic solutions
are just the thing. This advantage has more to do with CSP than
functional programming. FP contributes by removing a lot of cruft in the
sequential programming parts, allowing the programmer to concentrate on
what is usually by far the most difficult problem in these systems.

BR,
Ulf W

Paul Rubin

unread,

Nov 20, 2007, 4:54:18 AM11/20/07

to

ste...@dino.dnsalias.com (Stephen J. Bevan) writes:
> > Yes I think this is the situation I'm concerned with.
> Concerned enough to describe an actual problem that can be coded up
> and measured?

Originally I'd been thinking of the relation gathering phase of the
number field sieve, but I don't understand that algorithm at all,
beyond having read some high level descriptions of how demanding it is
on a computer's memory system, so I won't attempt to describe it.

Per Florian Weimer's suggestion: how about a marking garbage
collector, maybe even one with pointer reversals. The main structure
in memory being traced is a tree with a few very long branches and a
number of shorter ones. It was allocated from mostly-but-not-entirely
contiguous chunks of memory which means that objects close together in
the link graph tend to be near each other in memory (i.e. good cache
locality if you don't jump around too much). I don't think that is
terribly unrealistic depending on the application. You have to
examine each node to see which one to visit next, so you need a quick
response to each query.

> My use of "process" here was ambigous. I'm using "process" to to mean
> either a kernel-level process or a user-level process.

OK.

> By all means offer one rigged in favour of shared memory, I'd like to
> see it.

See above.

> > Fraser tested on something like a 100-way box so it's of concern.
>
> Sure, it is a concern to people who own 100-way boxes, but it is going
> to be quite a while before generic server or desktop code has to worry
> about how to program them effectively.

I don't believe that. I'm typing this message with emacs, a program
still consisting mostly of code written over 20 years ago. I think
we'll have 100 core cpu's in the coming decades, running the code that
we're writing today. We already have 8-way boxes on desktops and
16-way in single chips in servers. Dates on calendar are closer than
they appear.

Vesa Karvonen

unread,

Nov 20, 2007, 5:22:10 AM11/20/07

to

Paul Rubin <http://phr...@nospam.invalid> wrote:

[...]

> > Sure, it is a concern to people who own 100-way boxes, but it is going
> > to be quite a while before generic server or desktop code has to worry
> > about how to program them effectively.

> I don't believe that. I'm typing this message with emacs, a program
> still consisting mostly of code written over 20 years ago.

Are you saying that Emacs runs too slowly for you? Which parts of
Emacs do you think would significantly benefit from massive
parallelism? Seriously, I think that adding parallelism to programs
that do not benefit from it as adding negative value (e.g. more
opportunities for really nasty bugs, reduced portability of code,
...). As shiny as the new hammer is, not everything is a nail.

-Vesa Karvonen

Paul Rubin

unread,

Nov 20, 2007, 5:25:26 AM11/20/07

to

Vesa Karvonen <vesa.k...@cs.helsinki.fi> writes:
> Are you saying that Emacs runs too slowly for you? Which parts of
> Emacs do you think would significantly benefit from massive
> parallelism?

No Emacs is fine. I'm saying programs we write for 4- and 8-way
systems today will be running on 100-way systems in a few years.

Paul Rubin

unread,

Nov 20, 2007, 5:41:56 AM11/20/07

to

Neelakantan Krishnaswami <ne...@cs.cmu.edu> writes:
> > This is really interesting, are you saying #1 uses something like
> > linear types to keep track of which data belongs to which process?
> Yes, that's right. That's what maintains strong process isolation.

Coooooooooooool!

> To make this communication fast, each CPU needs to be connected to all
> the others, which is an O(n^2) dependency. This is an awful lot of
> communication to require on every write, which is why hardware vendors
> have been steadily weakening the memory consistency guarantees they
> offer us programmers.

You've mentioned this before, and it's interesting, can you say more
about it?

> Another way of describing this is that the abstraction of global
> shared memory requires a lot of message passing under the hood --
> communication that a message passing architecture a) makes explicit,
> and b) potentially lets you avoid doing when you don't need to.

Right, it's that part b (about avoiding message passing) that seems
most promising to me. Message passing with a mutex is just a special
case of locking; it's operation of a lock, plus a transfer of control.
The question is if you can avoid the transfer of control (and latency,
loss of locality, etc) while keeping the code clean. I like to think
it's possible. If CSP is the best abstraction for programmers, that's
fine, but maybe compilers can turn the CSP code into mostly-serial
code (I don't mean with threadlets) the way they turn chained
recursive HOF's into straightforward imperative loops through
deforestation, CPS, etc.

> This state of affairs won't last, though, and it's unwise to base a
> language design on the idea it will always be there. Erlang is future
> proof in this regard -- the more I learn about it, the more its design
> impresses me. It's hard to make a language that can survive profound
> architectural changes with its performance model intact!

Well, if it starts out accepting a potentially huge performance hit...

> Message passing only makes explicit the communication that shared
> memory is already doing.

Right but hardware can do stuff a lot faster than software. I mean
taken to an extreme, it sounds almost like comparing software floating
point arithmetic with hardware floating point. Sure, the hardware is
doing the same thing under the hood, but it's orders of magnitude
faster. Even that n^2 dependency, I'm not sure how bad that is with
real workloads until n gets really large. There's a lot of
parallelism possible with simple hardware.

> Note that when I say "bottleneck", I mean that there's a global,
> implementation-imposed limit on the parallel speedup a program can
> achieve. So even if you have a ridiculously parallel problem, you
> might not be able to get full utilization.

DJ Bernstein likes to go on about how even the asymptotic performance
(e.g. the size of the exponent in O(n**k)) of a parallel algorithm can
beat a serial one using the same amount of hardware. So you can get
MORE than full utilization. This gets more interesting once we start
having large FPGA's in our general purpose desktop cpu's...

Vesa Karvonen

unread,

Nov 20, 2007, 6:24:32 AM11/20/07

to

Ulf Wiger, from the company that sells Erlang products:
[...]

> In my limited understanding of CML, focus was generally on coming
> up with a concurrency model that would not violate the type semantics

> of ML. [...]

To know for sure, you'd have to ask about this from John Reppy. But
let me just quote a sentence from the back of Reppy's book (Concurrent
Programming in ML [1]):

"The main focus of this book is on the practical use of concurrency
to implement naturally concurrent applications."

If you are just suggesting that the semantics of CML were designed
with the goal that they would be, as far as possible, compatible with
the semantics of SML, then that would be rather obvious, wouldn't it,
because CML was intentionally designed as an extension of SML. If you
actually think about it, that is a rather practical approach.

-Vesa Karvonen

[1] http://www.cambridge.org/uk/catalogue/catalogue.asp?isbn=9780521714723

ross...@ps.uni-sb.de

unread,

Nov 20, 2007, 6:29:52 AM11/20/07

to

On 20 Nov., 10:42, Ulf Wiger <ulf.wi...@e-r-i-c-s-s-o-n.com> wrote:

> rossb...@ps.uni-sb.de wrote:
> > On 18 Nov., 20:14, Lauri Alanko <l...@iki.fi> wrote:
>
> > The concurrent semantics of state has been formally specified for at
> > least some concurrent extensions of ML (e.g. CML, Alice ML). Does such
> > a spec even exist for Erlang? There are several multi-threaded ML
> > implementations out there, and as far as I know, none has been plagued
> > by the problems you describe.
>
> I guess this would be a good starting point to investigate
> what semantics have been specified for Erlang:
>
> http://portal.acm.org/citation.cfm?id=1292520.1292528
>
> In short, Lars-Åke Fredlund's PhD thesis from 2001 presented a formal
> semantics for Erlang, sans the distributed messaging part.

Ah, good to know. Thanks for the pointer.

With respect to the rest of your post: just to clarify, my remark
wasn't meant as a criticism of Erlang, or even suggest that something
like CML could compete with it on a larger scale of things (which I
don't believe). I only jumped on Lauri's specific claim about Erlang
providing more thread-safety for references than a concurrent ML,
which I think is false.

- Andreas

Jon Harrop

unread,

Nov 20, 2007, 6:55:29 AM11/20/07

to

Lauri Alanko wrote:
> In article
> <fa409b76-6c2d-47ad...@w34g2000hsg.googlegroups.com>,
> <ross...@ps.uni-sb.de> wrote:
>> On 18 Nov., 03:56, Lauri Alanko <l...@iki.fi> wrote:
>> Mh, I don't understand. In which sense is a traditional ML ref less
>> thread-safe?
>
> I simply meant that ML as a language...

There is no such thing: ML is a family of languages including Concurrent ML,
Alice ML, JoCaml and F# (all of which support concurrency).

--
Dr Jon D Harrop, Flying Frog Consultancy Ltd.
http://www.ffconsultancy.com/products/?u

Jon Harrop

unread,

Nov 20, 2007, 7:17:50 AM11/20/07

to

Lauri Alanko wrote:
> In article <13k0ba4...@corp.supernews.com>,
> Jon Harrop <use...@jdh30.plus.com> wrote:
>> F#'s asynchronous workflows appear to provide what I described. In the
>> simplest case, you can just take the let binding of an expression:
>>
>> let x = ...
>>
>> and have it evaluated concurrently:
>>
>> let! x = ...
>
> Presumably x returns a future that will be forced only when x is
> actually inspected?

Exactly, yes.

>> The code can still be purely functional. Any interaction between
>> processes/threads/whatever is completely abstracted away.
>
> Right. Similar constructs can be found in e.g. GHC and Alice.

Yes.

> But the reason these kinds of computations can be parallelized
> automatically is that there really is no "interaction" as such: the
> spawned process gets all its input upon startup, and it produces just
> a single value.

Exactly.

> This is vastly different from the situations where concurrency is
> required.

Then we have different definitions of "concurrency".

>> Well, it provides a "concurrent, high-performance library of strict,
>> segmented arrays, which we call package ndp".
>
> All right, granted, perhaps the terminology isn't followed as
> rigorously as I'd like. :) I'm using the terms as defined here:
>
> http://www.haskell.org/ghc/docs/latest/html/users_guide/lang-parallel.html
>
> And here:
>
> http://www.haskell.org/ghc/docs/latest/html/users_guide/using-smp.html
>
> There's a fine distinction between concurrency and
> parallelism: parallelism is all about making your program run
> faster by making use of multiple processors simultaneously.
> Concurrency, on the other hand, is a means of abstraction: it
> is a convenient way to structure a program that must respond
> to multiple asynchronous events.
>
> These are orthogonal concepts, since a multi-threaded program can be
> run on a single processor with no parallelism, and a threadless
> program can be run transparently on multiple processors (like in your
> example).
>
> It seems clear that you by this definition, you are talking about
> parallelism...

How are futures about parallelism and not concurrency?

>> I was thinking: if each data structure is owned by a thread, how do you
>> combine data structures to create new ones? Set union is one example of
>> doing that: I don't want to just fetch elements sequentially (that is
>> inefficient).
>
> If processes are located at the same node (like they are in
> traditional multithreaded programming), then data structures aren't
> "owned" by anyone: since they are immutable, references to them can be
> cheaply passed to other processes.
>
> The problem is that such usage of shared memory, even under the hood,
> becomes relatively more and more expensive all the time. Processors
> and cores get more and more complicated memory hierarchies, and access
> to the shared RAM requires locks and cache flushes and whatnot. This
> trend is likely to continue: processors aren't getting much faster any
> more, there are just going to be more and more of them.
>
> This is why message passing is now, more than ever, becoming the Right
> Way to model concurrency: not only is it semantically far easier to
> grasp than shared memory (especially when there are no transactions),
> but it also presents the costs more faithfully: communicating between
> processes _is_ becoming a very expensive operation. The
> message-passing style makes it explicit when you are doing it and
> encourages you to design your program to minimize inter-process
> communication.

That is one side of a double edged sword: message passing implementations
are slower by a constant factor but scale well to massive concurrency. That
constant factor is so large that it will remain significant for many years
to come. So it is premature to advise people to move to message passing
style concurrency in the interests of performance.

Vesa Karvonen

unread,

Nov 20, 2007, 7:46:21 AM11/20/07

to

ross...@ps.uni-sb.de wrote:
[...]

> With respect to the rest of your post: just to clarify, my remark
> wasn't meant as a criticism of Erlang, or even suggest that something
> like CML could compete with it on a larger scale of things (which I

> don't believe). [...]

An obvious technical difference between Erlang and CML is that CML is
primarily based on synchronous message passing which is not suitable
for (transparent) distributed computing as such, and this intentional
design choice and its implications are discussed in Reppy's book. So,
there is a difference in intention behind Erlang and CML, and I don't
think that they would be competing for precisely the same domains.

OTOH, Erlang has a big company behind it, so it is clear that a lot
more resources have been put into making Erlang a practical tool.
However, I don't see anything in the foundations of CML and Erlang
that would make CML fundamentally incapable of competing with Erlang
on a wide range of applications (particularly non-distributed
applications). CML's concurrency constructs (including first-class
events) are more than comparable with the expressiveness of Erlang's
concurrency constructs. Given more resources for libraries and
improving the implementations, I think that CML could give Erlang more
than a run for its money on many application domains. In particular,
some SML compilers seem to beat Erlang on sequential processing
benchmarks (factors between 1.3 and 74/102 for time/space). The two
current CML implementations (SML/NJ's and MLton's) also scale quite
well (SML/NJ's very well) in terms of support large numbers of threads
on a single processor machine. This would suggest that given a solid
multi-core CML implementation, it would be unsurprising to see such a
CML implementation beating Erlang (implementations) on a wide range of
concurrent and parallel programming benchmarks, whether contrived or
meaningful.

Of course, Andreas has been working on Alice ML, which is also an
extensions of SML with support for concurrent programming among other
things and is very different from CML. So, could you elaborate on why
you believe that CML couldn't compete with Erlang? Is it because of
fundamental difference or mainly just about the marketing muscle?

-Vesa Karvonen

ross...@ps.uni-sb.de

unread,

Nov 20, 2007, 8:04:40 AM11/20/07

to

On Nov 20, 1:46 pm, Vesa Karvonen <vesa.karvo...@cs.helsinki.fi>
wrote:

>
> Of course, Andreas has been working on Alice ML, which is also an
> extensions of SML with support for concurrent programming among other
> things and is very different from CML. So, could you elaborate on why
> you believe that CML couldn't compete with Erlang? Is it because of
> fundamental difference or mainly just about the marketing muscle?

There is a variety of reasons, resources and (to a minor extent)
marketing certainly being at the top of the list. But there are
technical/psychological reasons, too. For example, Erlang is designed
such that there is a much more obvious way to design concurrent
programs, in particular, avoding state. With ML I fear that most
mediocre programmers (which make up the majority for markets like the
one Erlang is used for) would be too tempted to fall back to
conventional imperative-style programming with lots of spurious state.
Also, I never found CML's primitives particularly straightforward to
use, so that I would expect programmers having more difficulties
wrapping their heads around them, in comparison to Erlang's rather
conventional message passing / actor model with its somewhat OOish
(and thus "familiar") feel to it.

And just to be clear: I certainly wouldn't think that Alice ML could
compete. ;-)

- Andreas

Ulf Wiger

unread,

Nov 20, 2007, 8:02:30 AM11/20/07

to

Indeed, I didn't mean to suggest that it wasn't a good approach, and
even, as you suggest, a practical one. I just wanted to highlight the
different motivations between Erlang and CML. , the more interesting as
they evolved more or less in parallel, and with many similarities.

BR,
Ulf W

Jon Harrop

unread,

Nov 20, 2007, 8:14:15 AM11/20/07

to

Vesa Karvonen wrote:
> Let me clarify a bit. Most higher-level concurrent languages provide

> you with very cheap threads...

I can't think of any mainstream languages that provide that. What languages
are you thinking of?

Jon Harrop

unread,

Nov 20, 2007, 8:15:58 AM11/20/07

to

David B. Benson wrote:
> If you mean threads which are compuationally intense, there is little
> problem, at least in CML. Once each 10 or 20 milliseconds (typical
> choices, but programmer adjuatable), the ready queue is inspected and
> if non-emoty, the top thread is enabled instead. So it is a non-issue.

1-2cs is a long time in this context. So long that it could even adversely
affect user IO.

Jon Harrop

unread,

Nov 20, 2007, 8:16:58 AM11/20/07

to

Lauri Alanko wrote:
> In article <7xmytby...@ruckus.brouhaha.com>,

> Paul Rubin <http://phr...@NOSPAM.invalid> wrote:
>> I thought I made it clear, I'm concerned about the latency of waiting
>> for an answer to the message.
>

> Why is latency such a concern for you? Are you thinking of real-time
> applications?

Real-time or any algorithms with implementations that still require
fine-grained concurrency.

Vesa Karvonen

unread,

Nov 20, 2007, 8:34:12 AM11/20/07

to

Jon:

> Vesa Karvonen wrote:
> > Let me clarify a bit. Most higher-level concurrent languages provide

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

> > you with very cheap threads...

> I can't think of any mainstream languages that provide that.

Which mainstream languages would you classify as "higher-level
concurrent languages"?

-Vesa Karvonen

Ulf Wiger

unread,

Nov 20, 2007, 8:38:24 AM11/20/07

to

Vesa Karvonen wrote:
> OTOH, Erlang has a big company behind it, so it is clear that a lot
> more resources have been put into making Erlang a practical tool.

I guess the resources put into developing Erlang by Ericsson would
be viewed as considerable from an academic viewpoint. By industrial
standards, the Erlang/OTP team has been embarrassingly cheap. (:

I would say that rather than having lots of resources (which really
isn't true), Erlang has had the advantage of a decade's worth of
focused effort towards the goal of building the best middleware
possible for telecoms. Even a rather small team can accomplish
much in 10 years.

> However, I don't see anything in the foundations of CML and Erlang
> that would make CML fundamentally incapable of competing with Erlang
> on a wide range of applications (particularly non-distributed
> applications). CML's concurrency constructs (including first-class
> events) are more than comparable with the expressiveness of Erlang's
> concurrency constructs.

I agree. You should take a look at the fault tolerance aspects
of Erlang as well. It's approach is a logical extension of the
CSP concurrency model, and I see no reason why CML couldn't do
pretty much the same thing.

> So, could you elaborate on why you believe that CML couldn't
> compete with Erlang? Is it because of fundamental difference
> or mainly just about the marketing muscle?

Not that the question was directed at me, but I just want to
object to the "marketing muscle" part. Ericsson has never been
in the business of selling programming languages, and has never
marketed Erlang. It did issue a press release in 1993, when the
subsidiary Erlang Systems was formed. At the time, I was based
in Alaska and actively tried to buy an Erlang license and support
contract. What disturbed me at the time was the obvious lack of
commitment to Erlang as a product. I viewed the size of Ericsson
as an impediment, rather than an advantage - since it obviously
did /not/ intend to put its name behind Erlang, it would be
more likely than a small company to simply drop it at any time.

Erlang Systems was supposed to market Erlang, and did so for a
while, but as Ericsson started using Erlang on a broader scale
in 1996, practically all resources at Erlang Systems were
devoted to internal consulting. At that time, I actually worked
at Erlang Systems with marketing, and thought that this was
exactly the kind of move that I'd been afraid of in '93, and
the type of attitude that kept others from taking Erlang
seriously.

I should clarify that I was frustrated for personal reasons,
not because I necessarily disagreed with the move. Ericsson
had never made any external commitment to Erlang, and didn't
break any promises by diverting efforts to internal
development. Erlang Systems really wasn't anything other
than a probe, and several Erlang-based products came out as
a result of the focused effort in '96. Besides, the marketing
effort wasn't going that well. Few companies were especially
interested in concurrency-oriented languages at the time.

Since then, there have been less than a handful press releases
from Ericsson that even mention Erlang in passing. There was
even quite some discussion about whether Ericsson shouldn't
just keep Erlang for itself. That debate was settled when Erlang
was released as Open Source in 1998.

Any marketing of Erlang nowadays is likely to be by Erlang Training
& Consulting in London, which has no affiliation with Ericsson.

BR,
Ulf W

Stephen J. Bevan

unread,

Nov 20, 2007, 10:29:15 AM11/20/07

to

Paul Rubin <http://phr...@NOSPAM.invalid> writes:

> ste...@dino.dnsalias.com (Stephen J. Bevan) writes:
>> > Yes I think this is the situation I'm concerned with.
>> Concerned enough to describe an actual problem that can be coded up
>> and measured?
>
> Originally I'd been thinking of the relation gathering phase of the
> number field sieve, but I don't understand that algorithm at all,
> beyond having read some high level descriptions of how demanding it is
> on a computer's memory system, so I won't attempt to describe it.
>
> Per Florian Weimer's suggestion: how about a marking garbage
> collector, maybe even one with pointer reversals.

> [snip]

> I don't think that is terribly unrealistic depending on the application.

If the GC is operating on per kernel process memory then a sequential
GC implementation is sufficient. If the GC is operating on shared
memory then it means you've already made the decision that your
application will used shared memory and so obviously the GC would be
written accordingly. Thus IMHO a GC for shared memory is not a good
example since it pre-supposes the need for shared memory. A better
example would be whatever application you think needs to be written on
a system with shared memory (and GC).

> > Sure, it is a concern to people who own 100-way boxes, but it is going
> > to be quite a while before generic server or desktop code has to worry
> > about how to program them effectively.
>
> I don't believe that. I'm typing this message with emacs, a program
> still consisting mostly of code written over 20 years ago. I think
> we'll have 100 core cpu's in the coming decades, running the code that
> we're writing today.

It is quite likely that in 20 years there will still be an emacs and
various other programs we run today but most won't utilize the 100-way
box in any useful way (except as 100 separate programs). The single
applications today that have a chance of being 100-way friendly those
that are written in a message passing style, especially those written
in a higher level language where the language implementer can take the
burden of updating the language runtime. Everything else will require
some level of application level re-writing.

Vesa Karvonen

unread,

Nov 20, 2007, 11:37:41 AM11/20/07

to

ross...@ps.uni-sb.de wrote:
> On Nov 20, 1:46 pm, Vesa Karvonen <vesa.karvo...@cs.helsinki.fi>
> wrote:
> >
> > Of course, Andreas has been working on Alice ML, which is also an
> > extensions of SML with support for concurrent programming among other
> > things and is very different from CML. So, could you elaborate on why
> > you believe that CML couldn't compete with Erlang? Is it because of
> > fundamental difference or mainly just about the marketing muscle?

> There is a variety of reasons, resources and (to a minor extent)
> marketing certainly being at the top of the list.

Well, my use of the term "marketing muscle" was more of a rhetorical
device with the intention to draw attention to the relative
availability of resources (funding) (I would estimate at least an
order of magnitude (10x) difference, probably much more).

[...]

> Erlang is designed such that there is a much more obvious way to
> design concurrent programs, in particular, avoding state.

I agree that Erlang seems to have been designed with an obvious way to
write concurrent programs. CML is an extension of SML, a sequential
language, so there are some constructs that one should stay away from
(or understand what one is doing) when writing concurrent apps.

> With ML I fear that most mediocre programmers (which make up the
> majority for markets like the one Erlang is used for) would be too
> tempted to fall back to conventional imperative-style programming
> with lots of spurious state.

Perhaps, but it is impossible to know for sure. SML already strongly
encourages one to avoid state.

> Also, I never found CML's primitives particularly straightforward to
> use, so that I would expect programmers having more difficulties
> wrapping their heads around them

I'm not sure what you mean by not being particularly straightforward
to use. Perhaps you could elaborate briefly on that. For example,
what specific difficulties do you have in mind? Although I cannot
claim that I would have done extensive programming using CML, in
particular, I do find programming in terms of events and other
primitives offered by CML relatively straightforward.

> in comparison to Erlang's rather conventional message passing /
> actor model with its somewhat OOish (and thus "familiar") feel to
> it.

This I find somewhat puzzling, because it is trivial to write such
(conventional message passing / actor style) code in CML. IOW, if you
find it difficult to "find the events" in your application, you can
always just program using mailboxes and use pattern matching over
messages instead of selective communication over events. You could
even introduce a tiny library to allow you to program almost exactly
like in Erlang.

> And just to be clear: I certainly wouldn't think that Alice ML could
> compete. ;-)

In terms of things like performance and overall implementation
maturity or in terms of things like expressiveness and ease of
programming? I can understand that you don't want to give a false
impression of Alice, but I think that it has a lot of interesting
concepts and potential.

-Vesa Karvonen

George Neuner

unread,

Nov 20, 2007, 11:47:49 AM11/20/07

to

On Tue, 20 Nov 2007 13:14:15 +0000, Jon Harrop <use...@jdh30.plus.com>
wrote:

>Vesa Karvonen wrote:
>> Let me clarify a bit. Most higher-level concurrent languages provide
>> you with very cheap threads...
>
>I can't think of any mainstream languages that provide that. What languages
>are you thinking of?

Not mainstream, but more conventional ... Ada and Concurrent Pascal.

George
--
for email reply remove "/" from address

David B. Benson

unread,

Nov 20, 2007, 5:00:57 PM11/20/07

to

On Nov 20, 3:55 am, Jon Harrop <use...@jdh30.plus.com> wrote:
...
>
> There is no such thing: ML is a family of languages including Concurrent ML,
> Alice ML, JoCaml and F# (all of which support concurrency).

To set the record straight, Concurrent ML is an
SML+callcc program, not a language with its own
syntax.

Joe Armstrong's book states that Erlang takes 3.5
microseconds to start a thread (on a 2.4 GHz machine).
On the same design (but not the same machine, rather
one here) CML threads take about 1.7 microseconds to
spawn.

On a earlier question, a good way to program in CML
is to send a message to another thread, the message
containing an iVar to hold the reply. The other
thread does an iPut with the required value and the
originating thread blocks waiting on an iGet until
the iPut is done. This is not only convenient and
very fast, but eliminates any deadly embrace possiblities
which may otherwise occur.

I find designing concurrent apps easy in CML. I do
find coding the designs to be a bit tedious. On the
other side of that, the resulting code is easy to read
and understand, even if I haven't looked at the code
for years.

I do not agree that concurrency is to be avoided for
efficiency reasons. Rather, I encourage concurrent
programming as much as is sensible. That is, design
comprehensible, understandable 'agents' which cooperate
to solve the problem. CML is so fast that if you are
already prepared to accept the performance penalty of
SML/NJ or MLton you'll not notice the additional cost
for sensible designs.

David B. Benson

unread,

Nov 20, 2007, 5:20:31 PM11/20/07

to

On Nov 20, 5:15 am, Jon Harrop <use...@jdh30.plus.com> wrote:
...

> 1-2cs is a long time in this context. So long that it could even adversely
> affect user IO.

I don't find it to do so.

Furthermore, my (possibly out-of-date) understanding
is that ordinary versions of Linux only send a timer
interrupt once each 10 milliseconds. So no finer
grandularity is possible without resorting to research
versions of Linux, which provide a timer interrupt once
each 2 milliseconds.

Jon Harrop

unread,

Nov 20, 2007, 10:08:52 PM11/20/07

to

Vesa Karvonen wrote:
> Which mainstream languages would you classify as "higher-level
> concurrent languages"?

Java and C#?

Markus E L

unread,

Nov 20, 2007, 4:57:54 PM11/20/07

to

George Neuner wrote:

> On Tue, 20 Nov 2007 13:14:15 +0000, Jon Harrop <use...@jdh30.plus.com>
> wrote:
>
>>Vesa Karvonen wrote:
>>> Let me clarify a bit. Most higher-level concurrent languages provide
>>> you with very cheap threads...
>>
>>I can't think of any mainstream languages that provide that. What languages
>>are you thinking of?
>
> Not mainstream, but more conventional ... Ada and Concurrent Pascal.

Java. Or aren't they cheap enough? :-)

- Markus

Vesa Karvonen

unread,

Nov 21, 2007, 5:31:13 AM11/21/07

to

Jon:

> Vesa Karvonen wrote:
> > Which mainstream languages would you classify as "higher-level
> > concurrent languages"?

> Java and C#?

Does the question mark indicate that you are unsure of how to classify
Java and C#? Perhaps there is a reason for that.

-Vesa Karvonen

Jon Harrop

unread,

Nov 21, 2007, 2:51:56 PM11/21/07

to

Vesa Karvonen wrote:
> Does the question mark indicate that you are unsure of how to classify
> Java and C#? Perhaps there is a reason for that.

They are mainstream high-level languages providing concurrency with only
heavy-weight threads.

So when you said "Most higher-level concurrent languages provide
you with very cheap threads" what were you thinking of?

George Neuner

unread,

Nov 21, 2007, 4:03:06 PM11/21/07

to

On Tue, 20 Nov 2007 22:57:54 +0100, Markus E L
<development-2006-8...@ANDTHATm-e-leypold.de> wrote:

There's a surprising amount of overhead in Java threads - much more so
than most people realize. A lot of it has to do with implementing the
security model and synchronized (monitor style) object locking. Java
threads are closer in weight to OS threads than to register-context
threads such as in Ada or CAML or Concurrent ML.

There is also lot of variation in implementation, some JVMs use green
threads, others OS threads, and still others M:N threads (green on
OS). There isn't a whole lot of difference, performance wise, between
the various thread models - differences in JVM performance come mostly
from differences in architecture, compiler and, particularly, lock
optimizations. But the thread model affects whether you can run code
that quickly creates lots of threads. You may be able to create a
million short-lived threads with no problem on one system and the same
code may blow up in your face on another.

[Not that I think creating jillions of threads is a good thing - for
the most part I believe threading, where possible, should be left to
compilers ... utility notwithstanding, the programmer thread model is
prone to overuse and to stupid mistakes.]

David B. Benson

unread,

Nov 21, 2007, 4:14:24 PM11/21/07

to

On Nov 21, 1:03 pm, George Neuner <gneuner2/@/comcast.net> wrote:
> ... the programmer thread model is

> prone to overuse and to stupid mistakes.

I disagree, although it is true that one must learn some
rules of designing concurrent programs before attempting
this mode of development. The 'agent' model solves
most issues of coordination which can easily arise in
developing sizable projects without concurrency.

George Neuner

unread,

Nov 21, 2007, 5:08:07 PM11/21/07

to

Timed scheduler interrupts are not really an issue for RT code because
tasks are prioritized and expected to run to completion unless
interrupted by higher priority tasks. There usually aren't a whole
lot of tasks with equal priority lying around to be timesliced.

Jon mentioned IO - which means devices. In systems that can support
the model, it is perfectly reasonable for interrupt handlers to be
written as threads. Basically there is a very simple low level
handler that just responds to the interrupt and takes some action that
unblocks a high(er) level handler thread - which does everything else
required.

Another reasonable model is 2-level, a synchronous interrupt handler
that maintains a small data buffer and reads/writes the device
directly, and a server (which may or may not be a thread) that streams
data to/from the handler and presents the device API to the world.

In either kind of system it is important to know how long the
interrupt will take to propagate to the device handler. How the
components communicate in the 2-level model is not terribly important,
but it is important to know how quickly the server will be scheduled
when the device handler needs its attention.

Vesa Karvonen

unread,

Nov 22, 2007, 3:08:03 AM11/22/07

to

George Neuner <gneuner2/@/comcast.net> wrote:
> On Tue, 20 Nov 2007 14:20:31 -0800 (PST), "David B. Benson"
> <dbe...@eecs.wsu.edu> wrote:

> >On Nov 20, 5:15 am, Jon:

> >...
> >> 1-2cs is a long time in this context. So long that it could even adversely
> >> affect user IO.

^^^^
[...]
> Jon mentioned IO - which means devices. [...]
^^^^^^^

I doubt that you are talking about the same issues.

-Vesa Karvonen