I have been comparing OCaml and F# recently and .NET's support for concurrency made F# look preferable.
I had thought that it was not possible to write concurrent programs in OCaml because its GC is not concurrent. However, I recently realised that you can actually write concurrent OCaml programs very easily by forking a Unix process and marshalling the resulting value or exception back to the parent.
For example, the following "invoke" function spawns a child process computing f(x) and returns a closure that blocks until the result is ready and returns it:
let invoke (f : 'a -> 'b) x : unit -> 'b = let input, output = Unix.pipe() in match Unix.fork() with | -1 -> (let v = f x in fun () -> v) | 0 -> Unix.close input; let output = Unix.out_channel_of_descr output in Marshal.to_channel output (try `Res(f x) with e -> `Exn e) []; close_out output; exit 0 | pid -> Unix.close output; let input = Unix.in_channel_of_descr input in fun () -> let v = Marshal.from_channel input in ignore (Unix.waitpid [] pid); close_in input; match v with | `Res x -> x | `Exn e -> raise e
This is quite a different approach to concurrency and, in particular, each new process has its own GC.
I have worked this into my ray tracer benchmark and, sure enough, it doubles performance on my dual-core machine.
In contrast, a parallel F# implementation does not double performance, perhaps because of context switching between 3 threads (two worker and one GC) on 2 CPUs.
Given OCaml remains both simpler and faster in all cases, is concurrent GC worth having? Can it scale as well? Can it be as efficient?
Jon Harrop wrote: > I have been comparing OCaml and F# recently and .NET's support for > concurrency made F# look preferable.
> I had thought that it was not possible to write concurrent programs in OCaml > because its GC is not concurrent. However, I recently realised that you can > actually write concurrent OCaml programs very easily by forking a Unix > process and marshalling the resulting value or exception back to the > parent. {stuff deleted} > Given OCaml remains both simpler and faster in all cases, is concurrent GC > worth having? Can it scale as well? Can it be as efficient?
So long as the two threads don't share any common mutable state. I think two separate processes with their own single-threaded GC will always be more efficient than a concurrent GC running two threads. The general case of GCing concurrent threads that share mutable state is fundamentally more difficult than having two threads that share no state with separate GCs.
Jon Harrop <j...@ffconsultancy.com> writes: > Given OCaml remains both simpler and faster in all cases, is concurrent GC > worth having? Can it scale as well? Can it be as efficient?
Of course. Look at the cost of marshalling those large structures between processes.
Daniel C. Wang wrote: > So long as the two threads don't share any common mutable state. I think > two separate processes with their own single-threaded GC will always be > more efficient than a concurrent GC running two threads. The general > case of GCing concurrent threads that share mutable state is > fundamentally more difficult than having two threads that share no state > with separate GCs.
So concurrent GC makes communication cheaper at the expense of everything else?
> Daniel C. Wang wrote: > > So long as the two threads don't share any common mutable state. I think > > two separate processes with their own single-threaded GC will always be > > more efficient than a concurrent GC running two threads. The general > > case of GCing concurrent threads that share mutable state is > > fundamentally more difficult than having two threads that share no state > > with separate GCs.
> So concurrent GC makes communication cheaper at the expense of everything > else?
It seems to me, that marshalling can be worked around by implementing specific serializer/deserializer functions for data structures being processed.
Thanks to the binary I/O API of the ExtLib[1], I was able to reduce the size of data produced by marshalling records. Though, I did not benchmark performance.
Vityok <bob...@ua.fm> writes: > It seems to me, that marshalling can be worked around by implementing > specific serializer/deserializer functions for data structures being > processed.
I'm more familiar with Haskell than ML so forgive me for using Haskell notation. Let's say I have a list of pairs (a,b):
x :: [(Integer, Integer)]
I want to add up all the a's, and I also want to add up all the b's:
sum(map fst x) sum(map snd x)
Because x is a very long list, I want to use separate threads or processes for these summations, to do them in parallel.
Can any method involving serialization do anywhere near as well as just letting both threads access x directly?
Jon Harrop wrote: > Daniel C. Wang wrote: >> So long as the two threads don't share any common mutable state. I think >> two separate processes with their own single-threaded GC will always be >> more efficient than a concurrent GC running two threads. The general >> case of GCing concurrent threads that share mutable state is >> fundamentally more difficult than having two threads that share no state >> with separate GCs.
> So concurrent GC makes communication cheaper at the expense of everything > else?
That's my basic claim. Note that it only makes shared memory communication cheaper. If you're going to distribute your job across machines. It doesn't even help! For the large class of embarrassingly parallel apps that have little communication costs. The overhead of concurrent GC which often requires the addition either a read or write barrier along with some mutator synchronization method. It doesn't seem worth the trouble, and if you really have a big job you're going to distribute anyway, so exploiting shared memory isn't going to buy you much.
If you have a single list producer that sends the components of each pair to two consumers you're just marshalling an int. So if you marshaling on demand in a lazy fashion I see no reason to see why sharing "x" would be more efficient.
Remember in a shared memory system the data goes over the memory bus to the cache of each processor. The shared memory model is really just an interface to message passing hardware. In the end it's all about bits going over wires. There's nothing magical about shared memory.
I can always decompose my app into a "memory server" that reads addresses and returns values, and a bunch of concurrent threads that request locations and receive values back. This will give me the same algorithmic complexity of any shared memory system.
Of course you could decompose your app in a smarter way and get potentially better algorithmic complexity and scalability. However, if you use shared memory you're basically limited to the "memory sever" model I described. Of course the hardware does a good job of making it fast. But for really parallel apps you hit a scaling limit and end up rolling your own message passing in the end to get beyond it.
MPI exists for a reason. BTW this message is from the CAML list is illustrative of my point.
Paul Rubin wrote: > Vityok <bob...@ua.fm> writes: >> It seems to me, that marshalling can be worked around by implementing >> specific serializer/deserializer functions for data structures being >> processed.
> I'm more familiar with Haskell than ML so forgive me for using Haskell > notation. Let's say I have a list of pairs (a,b):
> x :: [(Integer, Integer)]
> I want to add up all the a's, and I also want to add up all the b's:
> sum(map fst x) > sum(map snd x)
> Because x is a very long list, I want to use separate threads or > processes for these summations, to do them in parallel.
> Can any method involving serialization do anywhere near as well as > just letting both threads access x directly?
Paul Rubin <http://phr...@NOSPAM.invalid> wrote: >Jon Harrop <j...@ffconsultancy.com> writes: >> Given OCaml remains both simpler and faster in all cases, is concurrent GC >> worth having? Can it scale as well? Can it be as efficient?
>Of course. Look at the cost of marshalling those large structures >between processes.
Erlang uses copying to pass messages, even for co-operative threads within a single OS process, and it isn't crippling. Erlang has a separate GCed heap for each thread which makes dealing with garbage cheap: small heaps, often short-lived, are easier to deal with than big sprawling ones. It also translates well to SMP and distributed settings.
The Erlang system can be run in an alternative mode with a shared heap and pointer passing for messages, or a hybrid mode where allocated data is placed on the shared or per-process heap depending on whether static analysis predicts that it'll be part of a message.
Tony. -- f.a.n.finch <d...@dotat.at> http://dotat.at/ ROCKALL MALIN: EAST OR NORTHEAST BECOMING VARIABLE 3 OR 4. SLIGHT OCCASIONALLY MODERATE. SHOWERS, FOG PATCHES. MODERATE OR GOOD, OCCASIONALLY VERY POOR.
Paul Rubin wrote: > Vityok <bob...@ua.fm> writes: >> It seems to me, that marshalling can be worked around by implementing >> specific serializer/deserializer functions for data structures being >> processed.
> I'm more familiar with Haskell than ML so forgive me for using Haskell > notation. Let's say I have a list of pairs (a,b):
> x :: [(Integer, Integer)]
> I want to add up all the a's, and I also want to add up all the b's:
> sum(map fst x) > sum(map snd x)
> Because x is a very long list, I want to use separate threads or > processes for these summations, to do them in parallel.
> Can any method involving serialization do anywhere near as well as > just letting both threads access x directly?
That is embarassingly parallel so it needs no serialization => forking will be faster than concurrent GC.
Indeed, I'm hard pressed to find a practically-important application that will be better off with a concurrent GC...
> That is embarassingly parallel so it needs no serialization => forking will > be faster than concurrent GC.
> Indeed, I'm hard pressed to find a practically-important application that > will be better off with a concurrent GC...
I imagine there are apps where a shared heap can get you a non-trivial constant factor. However, if you care about scalability the concurrent GC of a shared heap can only get in your way. I suspect you can prove the statement above in a formal way.