Concurrent GC: a good thing?

Jon Harrop

unread,

Jun 11, 2007, 3:39:30 PM6/11/07

to

I have been comparing OCaml and F# recently and .NET's support for
concurrency made F# look preferable.

I had thought that it was not possible to write concurrent programs in OCaml
because its GC is not concurrent. However, I recently realised that you can
actually write concurrent OCaml programs very easily by forking a Unix
process and marshalling the resulting value or exception back to the
parent.

For example, the following "invoke" function spawns a child process
computing f(x) and returns a closure that blocks until the result is ready
and returns it:

let invoke (f : 'a -> 'b) x : unit -> 'b =
let input, output = Unix.pipe() in
match Unix.fork() with
| -1 -> (let v = f x in fun () -> v)
| 0 ->
Unix.close input;
let output = Unix.out_channel_of_descr output in
Marshal.to_channel output (try `Res(f x) with e -> `Exn e) [];
close_out output;
exit 0
| pid ->
Unix.close output;
let input = Unix.in_channel_of_descr input in
fun () ->
let v = Marshal.from_channel input in
ignore (Unix.waitpid [] pid);
close_in input;
match v with
| `Res x -> x
| `Exn e -> raise e

This is quite a different approach to concurrency and, in particular, each
new process has its own GC.

I have worked this into my ray tracer benchmark and, sure enough, it doubles
performance on my dual-core machine.

In contrast, a parallel F# implementation does not double performance,
perhaps because of context switching between 3 threads (two worker and one
GC) on 2 CPUs.

Given OCaml remains both simpler and faster in all cases, is concurrent GC
worth having? Can it scale as well? Can it be as efficient?

--
Dr Jon D Harrop, Flying Frog Consultancy
OCaml for Scientists
http://www.ffconsultancy.com/products/ocaml_for_scientists/?usenet

Daniel C. Wang

unread,

Jun 12, 2007, 12:25:30 AM6/12/07

to Jon Harrop

Jon Harrop wrote:
> I have been comparing OCaml and F# recently and .NET's support for
> concurrency made F# look preferable.
>
> I had thought that it was not possible to write concurrent programs in OCaml
> because its GC is not concurrent. However, I recently realised that you can
> actually write concurrent OCaml programs very easily by forking a Unix
> process and marshalling the resulting value or exception back to the
> parent.

{stuff deleted}

> Given OCaml remains both simpler and faster in all cases, is concurrent GC
> worth having? Can it scale as well? Can it be as efficient?

So long as the two threads don't share any common mutable state. I think
two separate processes with their own single-threaded GC will always be
more efficient than a concurrent GC running two threads. The general
case of GCing concurrent threads that share mutable state is
fundamentally more difficult than having two threads that share no state
with separate GCs.

Paul Rubin

unread,

Jun 12, 2007, 12:39:42 AM6/12/07

to

Jon Harrop <j...@ffconsultancy.com> writes:
> Given OCaml remains both simpler and faster in all cases, is concurrent GC
> worth having? Can it scale as well? Can it be as efficient?

Of course. Look at the cost of marshalling those large structures
between processes.

Jon Harrop

unread,

Jun 12, 2007, 12:44:48 AM6/12/07

to

Daniel C. Wang wrote:
> So long as the two threads don't share any common mutable state. I think
> two separate processes with their own single-threaded GC will always be
> more efficient than a concurrent GC running two threads. The general
> case of GCing concurrent threads that share mutable state is
> fundamentally more difficult than having two threads that share no state
> with separate GCs.

So concurrent GC makes communication cheaper at the expense of everything
else?

Vityok

unread,

Jun 12, 2007, 2:52:12 AM6/12/07

to

Jon Harrop :

> Daniel C. Wang wrote:
> > So long as the two threads don't share any common mutable state. I think
> > two separate processes with their own single-threaded GC will always be
> > more efficient than a concurrent GC running two threads. The general
> > case of GCing concurrent threads that share mutable state is
> > fundamentally more difficult than having two threads that share no state
> > with separate GCs.
>
> So concurrent GC makes communication cheaper at the expense of everything
> else?

It seems to me, that marshalling can be worked around by implementing
specific serializer/deserializer functions for data structures being
processed.

Thanks to the binary I/O API of the ExtLib[1], I was able to reduce
the size of data produced by marshalling records. Though, I did not
benchmark performance.

[1] http://ocaml-lib.sourceforge.net/doc/IO.html

With best regards,
Victor Anyakin

Paul Rubin

unread,

Jun 12, 2007, 3:05:21 AM6/12/07

to

Vityok <bob...@ua.fm> writes:
> It seems to me, that marshalling can be worked around by implementing
> specific serializer/deserializer functions for data structures being
> processed.

I'm more familiar with Haskell than ML so forgive me for using Haskell
notation. Let's say I have a list of pairs (a,b):

x :: [(Integer, Integer)]

I want to add up all the a's, and I also want to add up all the b's:

sum(map fst x)
sum(map snd x)

Because x is a very long list, I want to use separate threads or
processes for these summations, to do them in parallel.

Can any method involving serialization do anywhere near as well as
just letting both threads access x directly?

Daniel C. Wang

unread,

Jun 12, 2007, 3:54:32 AM6/12/07

to Jon Harrop

Jon Harrop wrote:
> Daniel C. Wang wrote:
>> So long as the two threads don't share any common mutable state. I think
>> two separate processes with their own single-threaded GC will always be
>> more efficient than a concurrent GC running two threads. The general
>> case of GCing concurrent threads that share mutable state is
>> fundamentally more difficult than having two threads that share no state
>> with separate GCs.
>
> So concurrent GC makes communication cheaper at the expense of everything
> else?
>

That's my basic claim. Note that it only makes shared memory
communication cheaper. If you're going to distribute your job across
machines. It doesn't even help! For the large class of embarrassingly
parallel apps that have little communication costs. The overhead of
concurrent GC which often requires the addition either a read or write
barrier along with some mutator synchronization method. It doesn't seem
worth the trouble, and if you really have a big job you're going to
distribute anyway, so exploiting shared memory isn't going to buy you much.

Daniel C. Wang

unread,

Jun 12, 2007, 4:19:47 AM6/12/07

to

If you have a single list producer that sends the components of each
pair to two consumers you're just marshalling an int. So if you
marshaling on demand in a lazy fashion I see no reason to see why
sharing "x" would be more efficient.

Remember in a shared memory system the data goes over the memory bus to
the cache of each processor. The shared memory model is really just an
interface to message passing hardware. In the end it's all about bits
going over wires. There's nothing magical about shared memory.

I can always decompose my app into a "memory server" that reads
addresses and returns values, and a bunch of concurrent threads that
request locations and receive values back. This will give me the same
algorithmic complexity of any shared memory system.

Of course you could decompose your app in a smarter way and get
potentially better algorithmic complexity and scalability. However, if
you use shared memory you're basically limited to the "memory sever"
model I described. Of course the hardware does a good job of making it
fast. But for really parallel apps you hit a scaling limit and end up
rolling your own message passing in the end to get beyond it.

MPI exists for a reason. BTW this message is from the CAML list is
illustrative of my point.

http://caml.inria.fr/pub/ml-archives/caml-list/2003/07/155910c4eeb09e684f02ea4ae342873b.en.html

Tony Finch

unread,

Jun 12, 2007, 6:52:27 AM6/12/07

to

Erlang uses copying to pass messages, even for co-operative threads within
a single OS process, and it isn't crippling. Erlang has a separate GCed
heap for each thread which makes dealing with garbage cheap: small heaps,
often short-lived, are easier to deal with than big sprawling ones. It
also translates well to SMP and distributed settings.

The Erlang system can be run in an alternative mode with a shared heap
and pointer passing for messages, or a hybrid mode where allocated data
is placed on the shared or per-process heap depending on whether static
analysis predicts that it'll be part of a message.

Tony.
--
f.a.n.finch <d...@dotat.at> http://dotat.at/
ROCKALL MALIN: EAST OR NORTHEAST BECOMING VARIABLE 3 OR 4. SLIGHT OCCASIONALLY
MODERATE. SHOWERS, FOG PATCHES. MODERATE OR GOOD, OCCASIONALLY VERY POOR.

Jon Harrop

unread,

Jun 12, 2007, 10:13:16 AM6/12/07

to

That is embarassingly parallel so it needs no serialization => forking will
be faster than concurrent GC.

Indeed, I'm hard pressed to find a practically-important application that
will be better off with a concurrent GC...

Daniel C. Wang

unread,

Jun 12, 2007, 11:05:04 AM6/12/07

to Jon Harrop

Jon Harrop wrote:
{stuff deleted}

> That is embarassingly parallel so it needs no serialization => forking will
> be faster than concurrent GC.
>
> Indeed, I'm hard pressed to find a practically-important application that
> will be better off with a concurrent GC...

I imagine there are apps where a shared heap can get you a non-trivial
constant factor. However, if you care about scalability the concurrent
GC of a shared heap can only get in your way. I suspect you can prove
the statement above in a formal way.

Ulf Wiger

unread,

Jun 13, 2007, 5:55:58 PM6/13/07

to

>>>>> "Jon" == Jon Harrop <j...@ffconsultancy.com> writes:

Jon> Given OCaml remains both simpler and faster in all cases, is
Jon> concurrent GC worth having? Can it scale as well? Can it be as
Jon> efficient?

Keep in mind that unix processes are quite heavy. If your
problem calls for lots of processes and e.g. client-server
patterns, you will suffer so much overhead wielding the
processes that multiple cores will not save you.

So if you have a program that can be partitioned into
few fairly large chunks which can run independently of
each other, then by all means, this will work well.

You *could* say that Erlang uses roughly the same strategy
but with processes that are orders of magnitude cheaper
than unix processes. GC is done per process, and
it has been demonstrated to scale very well with multicore.

BR,
Ulf W
--
Ulf Wiger, Senior Specialist,
/ / / Architecture & Design of Carrier-Class Software
/ / / Team Leader, Software Characteristics
/ / / Ericsson AB, IMS Gateways

Paul Rubin

unread,

Jun 14, 2007, 12:33:08 AM6/14/07

to

"Daniel C. Wang" <danw...@gmail.com> writes:
> Remember in a shared memory system the data goes over the memory bus
> to the cache of each processor. The shared memory model is really just
> an interface to message passing hardware. In the end it's all about
> bits going over wires. There's nothing magical about shared memory.

The difference is it doesn't need a bunch of extra instructions to
serialize and deserialize the int, context switches to communicate
between processes, etc.

> I can always decompose my app into a "memory server" that reads
> addresses and returns values, and a bunch of concurrent threads that
> request locations and receive values back. This will give me the same
> algorithmic complexity of any shared memory system.

Sure, and concurrent GC will have the same algorithmic complexity as
single-threaded GC. But for real problems, the constant factors
matter.

For that matter, there are different measures of algorithmic
complexity. There is time and space complexity which are familiar,
but there is also dollar complexity (the asymptotic cost in dollars of
solving a problem instance of size N, as a function of N). The dollar
complexity significantly depends on the machine architecture
(e.g. whether there is shared memory, and how it works). Of course
this is obvious: an O(N) algorithm on a conventional computer might be
O(N**2) on a 1-tape Turing machine of the same hardware cost and
depreciation schedule, so the conventional computer can solve more
problems per dollar. Even fancier memory architecture turns out to
asymptotically beat conventional computers in dollars per solution on
some problems, as observed in Bernstein's famous paper on NFS circuits:

http://cr.yp.to/factorization.html#nfscircuit

(see remark at top of page 5.) I don't see an obvious argument that a
shared-memory multiprocessor computer can't _asymptotically_ beat a
non-shared multicomputer system of the same hardware cost, as a
function of increasing hardware cost.

Daniel C. Wang

unread,

Jun 14, 2007, 1:34:41 AM6/14/07

to

Paul Rubin wrote:
{stuff deleted}

> I don't see an obvious argument that a
> shared-memory multiprocessor computer can't _asymptotically_ beat a
> non-shared multicomputer system of the same hardware cost, as a
> function of increasing hardware cost.

I don't see any argument either. However, I will observer that a
supercomputer these days is a bunch of commodity CPUs connected via
Ethernet.

http://top500.org/lists/2006/11
http://top500.org/stats/28/connfam/

A lot of my misspent "youth" in grad school was spent listening to
people talk about how wonderful distributed shared memory was. I'd
always ask annoying questions like "what's wrong with MPI". We'll the
dust has settled and DSM and NUMA machines doing HPC apps for the
military, have been replaced by Beowulf clusters running Monte Carlo
simulations for hedge funds. Go figure.

Ulf Wiger

unread,

Jun 14, 2007, 3:57:21 PM6/14/07

to

>>>>> "Paul" == Paul Rubin <http://phr...@NOSPAM.invalid> writes:

Paul> For that matter, there are different measures of algorithmic
Paul> complexity. There is time and space complexity which are
Paul> familiar, but there is also dollar complexity (the asymptotic
Paul> cost in dollars of solving a problem instance of size N, as a
Paul> function of N).

The cost factor is often forgotten. It's so easy to think that
you can just throw hardware at the problem, since hardware is
so cheap.

I've learned to consider manufacturing cost a very important
factor when delivering systems - partly because, in our
business, we always deliver both hardware and software.
Once written, the distribution cost of software is
insignificant, but the distribution cost of hardware eats
right into your profit margin.

To make matters worse, our hardware needs to be NEBS3 compliant,
which means it isn't even cheap anymore.

http://en.wikipedia.org/wiki/NEBS