I had naively thought that Erlang would be sending process messages
around profiting from the fact that data is immutable and NOT copying
it. And really just send a 'pointer'.
Why is that not so, it should be a huge gain from immutable data,
especially with bigger data. You don't have to copy, knowing that your
data is immutable.
Thanks for a link or a brief,
Henning
_______________________________________________
erlang-questions mailing list
erlang-q...@erlang.org
http://erlang.org/mailman/listinfo/erlang-questions
It makes garbage collection much easier and more predictable as the
heaps of each process then can be completely isolated. There have been
attempts to make a hybrid heap which would allows non copy message
passing, but afaik the attempts have been abandoned.
Lukas
Did you ever do metalwork at school?
They teach you: "see that bump in the metal? If you just bang that flat with a hammer, the
distortion will just pop up somewhere else. You have to smooth things out by tapping gently
all over the place, spreading the distortion out so it's not noticeable."
It's like that with computing costs. If you make one thing cheaper, you often make something
else more expensive. Messages should normally be *small*, so that they are cheap to copy (any
bulk data is normally a binary, and large binaries *are* shared between processes within a
node). Sharing a heap means that garbage collection is now a nasty lots-of-threads-interacting-
with-a-giant-mutable-data-structure problem, affecting *everything* all the time.
A shared heap has been tried for Erlang, and it will doubtless be tried again in the future.
Have there been attempts in moving structures from one process to the
next (instead of copying), using escape analysis to see whether a
structure can be safely moved?
Using escape analysis to determine if the whole message should be
copied or only a reference would require a rewrite of the GC from
copy-and-collect to something like mark and sweep. And I'm doubtful
that many cases could be identified where a none of the terms part of
the data sent would be kept in the process sending the message.
Lukas
Currently the content of the message is copied from one process heap
to the next, in the case of local messages, correct?
If the runtime knew the message (and its components) is not re-used
after sending it, it could *move* the whole message from one heap to
the next skipping the copy step (but needing to cooperate a bit with
the GC I guess, in order to invalidate the local pointer in the
origin process).
> Using escape analysis to determine if the whole message should be
> copied or only a reference would require a rewrite of the GC from
> copy-and-collect to something like mark and sweep.
Not sure why.
> And I'm doubtful
> that many cases could be identified where a none of the terms part of
> the data sent would be kept in the process sending the message.
Yes, so am I. It would mostly be useful for small one-off messages, and
because Erlang has no "unique" semantics (à la rust) it wouldn't work
when messages are being sent from a function (which is pretty much
always the case when using OTP).
>> And I'm doubtful
>> that many cases could be identified where a none of the terms part of
>> the data sent would be kept in the process sending the message.
>
> Yes, so am I. It would mostly be useful for small one-off messages, and
> because Erlang has no "unique" semantics (à la rust) it wouldn't work
> when messages are being sent from a function (which is pretty much
> always the case when using OTP).
This topic has been discussed several times on this m.l. and in other places.
If you google "erlang shared heap" and the same + paper or +
dissertation, you'll find many useful references, analyzing the pros
and cons of the various solutions.
For example, you may start from here
http://scholar.google.com/scholar?q=erlang+shared+heap
HTH
P.
----- Original Message -----
> From: Masklinn <mask...@masklinn.net>
> To: Lukas Larsson <lu...@erlang-solutions.com>
> Cc: Erlang Questions <erlang-q...@erlang.org>
> Sent: Thursday, February 23, 2012 9:46 AM
> Subject: Re: [erlang-questions] Why are messages between processes copied?
>
> On 2012-02-23, at 09:22 , Lukas Larsson wrote:
>> Not sure what you mean by moving structures as opposed to copying?
>
> Currently the content of the message is copied from one process heap
> to the next, in the case of local messages, correct?
>
> If the runtime knew the message (and its components) is not re-used
> after sending it, it could *move* the whole message from one heap to
> the next skipping the copy step (but needing to cooperate a bit with
> the GC I guess, in order to invalidate the local pointer in the
> origin process).
I think Richard Carlsson et al tried something similar, which I recall as follows (sorry if it's garbled): allocate data that may be sent in a global heap, while definitely private data goes in the local heap of the process. Classify the data using static analysis. Message sends then become pointer management rather than copying. I'm not sure of the outcome, percent gained etc.
Another approach to message passing I liked was Hippo, a prototype by Erik Stenman and Sven-Olof Nyström. Complex and finicky but novel thinking and potentially even lower cost. Would need a lot of development to go anywhere though.
None of these attempts (as well as others with shared or hybrid heaps, available in older OTP distros) seem to have really taken root. At a guess, the current runtime system does well enough on "normal programs" that we're sufficiently well-off as we are. In a previous life and for a previous runtime system, we measured that message passing _might_ become a practical bottleneck if the sequential runtime system (using Hipe) was 2x faster or so. I suspect this factor may have increased in the meantime due to various OTP hacks.
Best regards,
Thomas
That's a bit different, what I talk about isn't semantically shared
heaps (although it could be implemented as such).
> None of these attempts (as well as others with shared or hybrid heaps, available in older OTP distros) seem to have really taken root. At a guess, the current runtime system does well enough on "normal programs" that we're sufficiently well-off as we are. In a previous life and for a previous runtime system, we measured that message passing _might_ become a practical bottleneck if the sequential runtime system (using Hipe) was 2x faster or so. I suspect this factor may have increased in the meantime due to various OTP hacks.
Yeah that's about what I'd expect as well, these approaches could work
if the static analysis was easy and the behavior was easy to understand
(which is what they're trying with Rust), otherwise it's just an (often
minor) optimization.
As soon as you go distributed this is not the case anymore though. One
of the hunches I have is that a copying system is simpler and is easier
to adapt to newer hardware, but I am not very good at divination. So I
would really be vary of embarking on a large project of rewrites to
obtain a performance gain - especially right now. There also seem to be
some work on hardware-based STM. That could probably benefit Erlang if
you made it work a bit like an ETS table - with a little work.
--
Jesper Louis Andersen
Erlang Solutions Ltd., Copenhagen, DK
Yes, it was part of the Hybrid Heap scheme. With the hybrid heap, you
allocate data on the local heaps first, but if the data gets sent as a
message, you move it to a global "message heap". Data that is already on
the message heap doesn't need to be copied again, so redirecting a
received message (or parts thereof) to another process just copies the
pointer. Testing where the data resides is done at runtime as part of
the send operation, and consists of a cheap pointer comparison, so
static analysis is not strictly needed, but only data that gets sent
more than once will benefit from the "raw" hybrid heap.
The main improvements could be had from what we named "message
analysis", figuring out what data structures were likely to end up as a
message and allocating them directly on the message heap. This analysis
had some interesting properties, since neither under- nor
overapproximation would cause errors (because the runtime checks for
copying will still be done).
Message analysis (http://dl.acm.org/citation.cfm?id=1146813) is
essentially the inverse of what Masklinn mentioned, escape analysis.
Because of Erlang's dynamic nature with modules reloaded on the fly,
escape analysis has to be so conservative that it misses most
opportunities for local allocation and forces most data to be allocated
on the global heap. This makes it more like a variant of the shared heap
architecture, which has too many problems with garbage collection to be
really viable.
Our measurements with the hybrid heap showed a lot of promise. However,
this work was done before the support for multithreading was added to
Beam, and ironically, while we saw it as the future heap architecture
for a multithreaded Beam, it was the major hackage done in the VM to
support multithreading that broke our implementation. By then, both
Jesper Wilhelmsson and I had moved on to full time jobs elsewhere, so
the work was orphaned. I still think it's probably a very good idea, if
Ericsson can find the people and time to get it running again. But it
will take quite a bit of knowledge, in particular about the GC (which
was Jesper's area).
/Richard
If the message already exists on the local heap of the sender, there is
no difference between "moving" it to the heap of the receiver, and
copying it (as the send operation does) and letting it become garbage on
the sender's heap if the sender has no more use for it. So maybe you're
thinking of allocating the message on the receiver's heap already when
the message is being created? That's pretty much the idea with the
hybrid heap using message analysis, except that we allocated the data on
a shared message heap (because the same data might be sent again, and
should then not need any further copying).
/Richard
A major revelation for me when I took a VLSI design course a few
decades ago: digital signals crossing chips don't move at the speed of
light, they take *quadratic* time as a function of distance. And we
haven't even started talking about power requirements, especially when
signals have to propagate off the chip.
One might take great comfort in a shared heap, thinking "ah, there's
so little copying now." But if stuff has to get copied around the
memory hierarchy anyway (albeit transparently via cache coherence and
cache replenishment), you're not really winning much. You might
*think* you're enjoying a wealth of computer power. But spare a
thought for the hapless geek who inherits your code and is asked to
repartition your system so that processes that used to share an
address space no longer do. What might have been performance
bottleneck so wide as to be only subliminally confining will narrow
down to ... the hypodermic needle through the camel's eye. And camels
can kick hard. (They also spit, I'm told.) The caravan will move on
without you, while you desperately repack everything the right way,
and beg rides from others. So much for your wealth of computing power
then, eh?
-michael turner
Imagine process A sends a message to PidB (ie says PidB ! Msg)
We're not supposed to know where PidB is (physically) - now suppose
Pib is either on the same
machine OR a different machine - the failure semantics are *entirely
different* - we don't want any danging references. We don't want a
pointer on the machine hosting PidB to point to
a machine that has crashed.
Getting this to work consistently with failures is very difficult -
sharing pointers makes it even more difficult.
Now it so happens that on modern massive multicores and in particular
large network on chip architectures, message passing times between
cores varies a lot - if the cores are adjacent
(on the chip) it can be very fast - but if they are separated this can
take far longer (I have seen factors of 60 here) - but once copied
data access becomes a local operation and is pretty fast.
Once copied to a different core GC becomes a parallel operation - this is
good :-)
In the future we can expect this trend to continue - message passing
to nearby cores will be
quick and to far away cores expensive - so it absolutely makes sense
to perform the expensive
operation once (copying) and the cheap operation (accessing local
data) fast (which it would not
be if it were via a pointer to a remote core) .
Erlang accidentally has the right properties to exploit multi-core
architectures - not by design
but by accident. The underlying reason was to make the semantics of
local and remote failures the same.
/Joe
FWIW there's previous work on this, one paper I came across recently:
Multicore Scheduling for Lightweight Communicating Processes
Carl G. Ritson, Adam T. Sampson, and Frederick R. M. Barnes
http://www.cs.kent.ac.uk/pubs/2009/2928/content.pdf
From the abstract:
Runtime heuristics dynamically group processes into cache-affine work
units based on communication patterns. Work units are then distributed
via wait-free work-stealing. Initial performance analysis shows that,
using the algorithms presented in this paper, process-oriented software
can execute with an efficiency approaching that of optimised sequential
and coarse-grain threaded designs.
BR,
-- Jachym
Yes and no :-)
There is a basic assumption in erlang - that time taken for Pid ! Msg depends
upon the size of the Messang an *not* upon Pid - assuming Pid is on
the same node. This may or may not be true on a multi core - it
depends very much on the chipset/motherboard/...
You can improve performance on a multicore by pinning processes to a
particular core - but the core will be very non-portable - ideally the
Erlang
run-time will figure out where to place processes for you - but this is
an area of performance tuning that is not well understood.
Moving processes around so they run on the same core is something
we should investigate.
As with all optimisations the best advice is "don't"
/Joe
I decided to try to test out the speed difference based on whether the
message being sent was a binary or not, so tried the following
-module(send_speed).
-compile(export_all).
time_it(NumProcesses, ListLength) ->
Term = random_list(ListLength),
Bin = term_to_binary(Term),
{Mterm, ok} =
timer:tc (send_speed, send_to_processes, [Term, spawn_N(NumProcesses)]),
{Mbin, ok} =
timer:tc (send_speed, send_to_processes, [Bin, spawn_N(NumProcesses)]),
{iolist_size (Term), iolist_size (Bin), Mterm, Mbin}.
random_list(N) ->
random:seed(now()),
[ random:uniform(255) || _ <- lists:seq(1,N) ].
spawn_N (N) ->
[ spawn(fun() -> receive _ -> ok end end) || _ <- lists:seq(1,N) ].
send_to_processes(Thing, Processes) ->
[ P ! Thing || P <- Processes ],
ok.
Then ran this in the shell as follows
Erlang R14B04 (erts-5.8.5) [source] [64-bit] [smp:2:2] [rq:2] [async-threads:0]
[hipe] [kernel-poll:false]
Eshell V5.8.5 (abort with ^G)
1> c(send_speed).
{ok,send_speed}
2> [ send_speed:time_it (10000, L) || L <- lists:seq (10,200,10) ].
[{10,14,7656,9396},
{20,24,8892,7580},
{30,34,10291,8261},
{40,44,10766,8411},
{50,54,11213,8080},
{60,64,11697,7802},
{70,74,12578,8802},
{80,84,13281,10130},
{90,94,13686,9701},
{100,104,13589,10325},
{110,114,15563,9724},
{120,124,21094,9751},
{130,134,18604,10466},
{140,144,19157,10178},
{150,154,19716,9812},
{160,164,18653,9582},
{170,174,19541,10453},
{180,184,20433,8898},
{190,194,21181,10227},
{200,204,20349,9916}]
3>
The four columns are the size of a list, the size in bytes of the binary
and results of sending to 10000 processes. So seems like they correllate
pretty well up until the magic 64 byte value, at which point the sending
of binaries seems to remain cheaper.
Now, I don't do anything with the message on the other side and once you
call binary_to_term you'll incur a copy out of the shared space, but if
you were able to do something with the binary without turning it back
to the term this might not be so bad.
Hopefully I didn't mess up my test too horribly,
-Anthony
--
------------------------------------------------------------------------
Anthony Molinaro <anth...@alumni.caltech.edu>
I am trying to understand how moving is different from copying.
>
>> Using escape analysis to determine if the whole message should be
>> copied or only a reference would require a rewrite of the GC from
>> copy-and-collect to something like mark and sweep.
>
> Not sure why.
Because if any part of a message is not copied,
that part is still in the original process's heap,
and even though it might have no references in the original
process, is still not garbage. Indeed, the sending process
might die.
Before trying to get fancy, the big question is
"DO WE HAVE A PROBLEM?"
If you have benchmarks showing that message copying is taking up
a substantial proportion of your system's time, then we have a
problem. If all of the data in a message are actually *examined*
by the receiver, it is difficult to see how copying could be a
*huge* problem.
Another issue of course is the process synchronisation required
during a message send. Using some kind of lock-free "add a node
to a queue" approach might well be more scalable than something
using per-process locks. If, that is, there are processes that
more than a few other processes want to send messages to.
Benchmarks! Numbers! Real Engineering!