I have been comparing OCaml and F# recently and .NET's support for concurrency made F# look preferable.
I had thought that it was not possible to write concurrent programs in OCaml because its GC is not concurrent. However, I recently realised that you can actually write concurrent OCaml programs very easily by forking a Unix process and marshalling the resulting value or exception back to the parent.
For example, the following "invoke" function spawns a child process computing f(x) and returns a closure that blocks until the result is ready and returns it:
let invoke (f : 'a -> 'b) x : unit -> 'b = let input, output = Unix.pipe() in match Unix.fork() with | -1 -> (let v = f x in fun () -> v) | 0 -> Unix.close input; let output = Unix.out_channel_of_descr output in Marshal.to_channel output (try `Res(f x) with e -> `Exn e) []; close_out output; exit 0 | pid -> Unix.close output; let input = Unix.in_channel_of_descr input in fun () -> let v = Marshal.from_channel input in ignore (Unix.waitpid [] pid); close_in input; match v with | `Res x -> x | `Exn e -> raise e
This is quite a different approach to concurrency and, in particular, each new process has its own GC.
I have worked this into my ray tracer benchmark and, sure enough, it doubles performance on my dual-core machine.
In contrast, a parallel F# implementation does not double performance, perhaps because of context switching between 3 threads (two worker and one GC) on 2 CPUs.
Given OCaml remains both simpler and faster in all cases, is concurrent GC worth having? Can it scale as well? Can it be as efficient?
Jon Harrop wrote: > I have been comparing OCaml and F# recently and .NET's support for > concurrency made F# look preferable.
> I had thought that it was not possible to write concurrent programs in OCaml > because its GC is not concurrent. However, I recently realised that you can > actually write concurrent OCaml programs very easily by forking a Unix > process and marshalling the resulting value or exception back to the > parent. {stuff deleted} > Given OCaml remains both simpler and faster in all cases, is concurrent GC > worth having? Can it scale as well? Can it be as efficient?
So long as the two threads don't share any common mutable state. I think two separate processes with their own single-threaded GC will always be more efficient than a concurrent GC running two threads. The general case of GCing concurrent threads that share mutable state is fundamentally more difficult than having two threads that share no state with separate GCs.
Jon Harrop <j...@ffconsultancy.com> writes: > Given OCaml remains both simpler and faster in all cases, is concurrent GC > worth having? Can it scale as well? Can it be as efficient?
Of course. Look at the cost of marshalling those large structures between processes.
Daniel C. Wang wrote: > So long as the two threads don't share any common mutable state. I think > two separate processes with their own single-threaded GC will always be > more efficient than a concurrent GC running two threads. The general > case of GCing concurrent threads that share mutable state is > fundamentally more difficult than having two threads that share no state > with separate GCs.
So concurrent GC makes communication cheaper at the expense of everything else?
> Daniel C. Wang wrote: > > So long as the two threads don't share any common mutable state. I think > > two separate processes with their own single-threaded GC will always be > > more efficient than a concurrent GC running two threads. The general > > case of GCing concurrent threads that share mutable state is > > fundamentally more difficult than having two threads that share no state > > with separate GCs.
> So concurrent GC makes communication cheaper at the expense of everything > else?
It seems to me, that marshalling can be worked around by implementing specific serializer/deserializer functions for data structures being processed.
Thanks to the binary I/O API of the ExtLib[1], I was able to reduce the size of data produced by marshalling records. Though, I did not benchmark performance.
Vityok <bob...@ua.fm> writes: > It seems to me, that marshalling can be worked around by implementing > specific serializer/deserializer functions for data structures being > processed.
I'm more familiar with Haskell than ML so forgive me for using Haskell notation. Let's say I have a list of pairs (a,b):
x :: [(Integer, Integer)]
I want to add up all the a's, and I also want to add up all the b's:
sum(map fst x) sum(map snd x)
Because x is a very long list, I want to use separate threads or processes for these summations, to do them in parallel.
Can any method involving serialization do anywhere near as well as just letting both threads access x directly?
Jon Harrop wrote: > Daniel C. Wang wrote: >> So long as the two threads don't share any common mutable state. I think >> two separate processes with their own single-threaded GC will always be >> more efficient than a concurrent GC running two threads. The general >> case of GCing concurrent threads that share mutable state is >> fundamentally more difficult than having two threads that share no state >> with separate GCs.
> So concurrent GC makes communication cheaper at the expense of everything > else?
That's my basic claim. Note that it only makes shared memory communication cheaper. If you're going to distribute your job across machines. It doesn't even help! For the large class of embarrassingly parallel apps that have little communication costs. The overhead of concurrent GC which often requires the addition either a read or write barrier along with some mutator synchronization method. It doesn't seem worth the trouble, and if you really have a big job you're going to distribute anyway, so exploiting shared memory isn't going to buy you much.
If you have a single list producer that sends the components of each pair to two consumers you're just marshalling an int. So if you marshaling on demand in a lazy fashion I see no reason to see why sharing "x" would be more efficient.
Remember in a shared memory system the data goes over the memory bus to the cache of each processor. The shared memory model is really just an interface to message passing hardware. In the end it's all about bits going over wires. There's nothing magical about shared memory.
I can always decompose my app into a "memory server" that reads addresses and returns values, and a bunch of concurrent threads that request locations and receive values back. This will give me the same algorithmic complexity of any shared memory system.
Of course you could decompose your app in a smarter way and get potentially better algorithmic complexity and scalability. However, if you use shared memory you're basically limited to the "memory sever" model I described. Of course the hardware does a good job of making it fast. But for really parallel apps you hit a scaling limit and end up rolling your own message passing in the end to get beyond it.
MPI exists for a reason. BTW this message is from the CAML list is illustrative of my point.
Paul Rubin wrote: > Vityok <bob...@ua.fm> writes: >> It seems to me, that marshalling can be worked around by implementing >> specific serializer/deserializer functions for data structures being >> processed.
> I'm more familiar with Haskell than ML so forgive me for using Haskell > notation. Let's say I have a list of pairs (a,b):
> x :: [(Integer, Integer)]
> I want to add up all the a's, and I also want to add up all the b's:
> sum(map fst x) > sum(map snd x)
> Because x is a very long list, I want to use separate threads or > processes for these summations, to do them in parallel.
> Can any method involving serialization do anywhere near as well as > just letting both threads access x directly?
Paul Rubin <http://phr...@NOSPAM.invalid> wrote: >Jon Harrop <j...@ffconsultancy.com> writes: >> Given OCaml remains both simpler and faster in all cases, is concurrent GC >> worth having? Can it scale as well? Can it be as efficient?
>Of course. Look at the cost of marshalling those large structures >between processes.
Erlang uses copying to pass messages, even for co-operative threads within a single OS process, and it isn't crippling. Erlang has a separate GCed heap for each thread which makes dealing with garbage cheap: small heaps, often short-lived, are easier to deal with than big sprawling ones. It also translates well to SMP and distributed settings.
The Erlang system can be run in an alternative mode with a shared heap and pointer passing for messages, or a hybrid mode where allocated data is placed on the shared or per-process heap depending on whether static analysis predicts that it'll be part of a message.
Tony. -- f.a.n.finch <d...@dotat.at> http://dotat.at/ ROCKALL MALIN: EAST OR NORTHEAST BECOMING VARIABLE 3 OR 4. SLIGHT OCCASIONALLY MODERATE. SHOWERS, FOG PATCHES. MODERATE OR GOOD, OCCASIONALLY VERY POOR.
Paul Rubin wrote: > Vityok <bob...@ua.fm> writes: >> It seems to me, that marshalling can be worked around by implementing >> specific serializer/deserializer functions for data structures being >> processed.
> I'm more familiar with Haskell than ML so forgive me for using Haskell > notation. Let's say I have a list of pairs (a,b):
> x :: [(Integer, Integer)]
> I want to add up all the a's, and I also want to add up all the b's:
> sum(map fst x) > sum(map snd x)
> Because x is a very long list, I want to use separate threads or > processes for these summations, to do them in parallel.
> Can any method involving serialization do anywhere near as well as > just letting both threads access x directly?
That is embarassingly parallel so it needs no serialization => forking will be faster than concurrent GC.
Indeed, I'm hard pressed to find a practically-important application that will be better off with a concurrent GC...
> That is embarassingly parallel so it needs no serialization => forking will > be faster than concurrent GC.
> Indeed, I'm hard pressed to find a practically-important application that > will be better off with a concurrent GC...
I imagine there are apps where a shared heap can get you a non-trivial constant factor. However, if you care about scalability the concurrent GC of a shared heap can only get in your way. I suspect you can prove the statement above in a formal way.
>>>>> "Jon" == Jon Harrop <j...@ffconsultancy.com> writes:
Jon> Given OCaml remains both simpler and faster in all cases, is Jon> concurrent GC worth having? Can it scale as well? Can it be as Jon> efficient?
Keep in mind that unix processes are quite heavy. If your problem calls for lots of processes and e.g. client-server patterns, you will suffer so much overhead wielding the processes that multiple cores will not save you.
So if you have a program that can be partitioned into few fairly large chunks which can run independently of each other, then by all means, this will work well.
You *could* say that Erlang uses roughly the same strategy but with processes that are orders of magnitude cheaper than unix processes. GC is done per process, and it has been demonstrated to scale very well with multicore.
BR, Ulf W -- Ulf Wiger, Senior Specialist, / / / Architecture & Design of Carrier-Class Software / / / Team Leader, Software Characteristics / / / Ericsson AB, IMS Gateways
> Remember in a shared memory system the data goes over the memory bus > to the cache of each processor. The shared memory model is really just > an interface to message passing hardware. In the end it's all about > bits going over wires. There's nothing magical about shared memory.
The difference is it doesn't need a bunch of extra instructions to serialize and deserialize the int, context switches to communicate between processes, etc.
> I can always decompose my app into a "memory server" that reads > addresses and returns values, and a bunch of concurrent threads that > request locations and receive values back. This will give me the same > algorithmic complexity of any shared memory system.
Sure, and concurrent GC will have the same algorithmic complexity as single-threaded GC. But for real problems, the constant factors matter.
For that matter, there are different measures of algorithmic complexity. There is time and space complexity which are familiar, but there is also dollar complexity (the asymptotic cost in dollars of solving a problem instance of size N, as a function of N). The dollar complexity significantly depends on the machine architecture (e.g. whether there is shared memory, and how it works). Of course this is obvious: an O(N) algorithm on a conventional computer might be O(N**2) on a 1-tape Turing machine of the same hardware cost and depreciation schedule, so the conventional computer can solve more problems per dollar. Even fancier memory architecture turns out to asymptotically beat conventional computers in dollars per solution on some problems, as observed in Bernstein's famous paper on NFS circuits:
(see remark at top of page 5.) I don't see an obvious argument that a shared-memory multiprocessor computer can't _asymptotically_ beat a non-shared multicomputer system of the same hardware cost, as a function of increasing hardware cost.
> I don't see an obvious argument that a > shared-memory multiprocessor computer can't _asymptotically_ beat a > non-shared multicomputer system of the same hardware cost, as a > function of increasing hardware cost.
I don't see any argument either. However, I will observer that a supercomputer these days is a bunch of commodity CPUs connected via Ethernet.
A lot of my misspent "youth" in grad school was spent listening to people talk about how wonderful distributed shared memory was. I'd always ask annoying questions like "what's wrong with MPI". We'll the dust has settled and DSM and NUMA machines doing HPC apps for the military, have been replaced by Beowulf clusters running Monte Carlo simulations for hedge funds. Go figure.
>>>>> "Paul" == Paul Rubin <http://phr...@NOSPAM.invalid> writes:
Paul> For that matter, there are different measures of algorithmic Paul> complexity. There is time and space complexity which are Paul> familiar, but there is also dollar complexity (the asymptotic Paul> cost in dollars of solving a problem instance of size N, as a Paul> function of N).
The cost factor is often forgotten. It's so easy to think that you can just throw hardware at the problem, since hardware is so cheap.
I've learned to consider manufacturing cost a very important factor when delivering systems - partly because, in our business, we always deliver both hardware and software. Once written, the distribution cost of software is insignificant, but the distribution cost of hardware eats right into your profit margin.
To make matters worse, our hardware needs to be NEBS3 compliant, which means it isn't even cheap anymore.