[erlang-questions] Fishing for best practices: distributed twin processes!

83 views
Skip to first unread message

Roberto Ostinelli

unread,
Sep 14, 2012, 6:27:17 PM9/14/12
to Erlang
Dear list,

I have a 2-layer architecture where:
  • the first layer is handling the socket connections to the outside world clients
  • the second layer is performing computations related to the connected clients (i.e. the core application)
It is built in this way so that we can make the socket handlers very stupid. Therefore, crashes on the first layer are less likely to occur, and in case of crashes of the second layer (the core application) we don't have all the clients trying to reconnect at the same time: they'll still be connected (even though receiving a 'service unavailable' message of some kind).

For every single process handling a socket connection on the first layer I therefore need to have a 'twin' process on the second layer, strictly connected to its socket manager counterpart.

I'm fishing for best practices here:
  • how can I assure that if the second layer goes down (VM crash, server hit by an angry sysadmin), the first layer socket processes know that their twin process is down (and the other way around)?
  • what is the best way to handle the creation of the twin processes?
Any input welcome. ^^_

r.

Michael Truog

unread,
Sep 14, 2012, 6:47:00 PM9/14/12
to Roberto Ostinelli, Erlang
Assuming you have the 2 layers in separate Erlang VMs.  You can have the Erlang VMs connected with distributed Erlang, and have the twin processes monitoring each other.  If you wanted a simple process death if either died, you could consider using a link instead of 2 monitors.  However, that seems like the simplest solution, to avoid unnecessary complexity.  You might find strangeness if you start not using the default net tick time (i.e., with a process link inbetween nodes), with distributed Erlang, but you probably know it is best to not play with that.
_______________________________________________ erlang-questions mailing list erlang-q...@erlang.org http://erlang.org/mailman/listinfo/erlang-questions

Roberto Ostinelli

unread,
Sep 14, 2012, 9:46:23 PM9/14/12
to Michael Truog, Erlang
hello Michael,

you're assuming right (separate VM), I'm familiar with links and monitors, thank you. However I doubt that any message is sent from a dying process if the VM on which it runs actually blows up. That was my point.

r.

Michael Truog

unread,
Sep 14, 2012, 9:55:04 PM9/14/12
to Roberto Ostinelli, Erlang
Yes, you will get a message from a monitor or link to a process on a remote node, which requires that the nodes be connected, when the remote process dies.  You can just have premature death if the net tick time is too long (not sure why, but it seemed like some internal assumption that is made, not controlled by the net tick time, saw in an older release but I assume it is the same still), so that is why it is best to stick with the default net tick time, unless you want to test that mechanism alot.

Michael Truog

unread,
Sep 14, 2012, 9:57:48 PM9/14/12
to Roberto Ostinelli, Erlang
Just to be clear, the remote VM dieing helps initiate the death of the links or monitors, since the node is shown as down locally.  You can catch that condition separately with node monitoring.

Torben Hoffmann

unread,
Sep 15, 2012, 2:21:49 AM9/15/12
to Roberto Ostinelli, Erlang
Hi Roberto,

Try contacting Laura Castro from Uni of Coruña who presented at the Erlang Workshop 2012 yesterday regarding handling of netsplits and node crashes.
Their team had to go through some investigations before they got the resilience they had promised the customer!

The big question for you will actually be what to do when the layers thinks that the other node is down - will you assume a net split and buffer communication (if that is feasible for your application)? Or will you assume node down and do a major clean-up?

Cheers,
___
 /orben
_______________________________________________
erlang-questions mailing list
erlang-q...@erlang.org
http://erlang.org/mailman/listinfo/erlang-questions

Max Bourinov

unread,
Sep 15, 2012, 2:46:08 AM9/15/12
to Torben Hoffmann, Erlang, Roberto Ostinelli
I afraid that in same cases dying VM won't send anything. Additional technique that might help in this situation is a watchdog process, but it bring another level of abstraction/complexity.

In our project we also need to monitor distributed systems and we will implement watchdogs.

Sent from my iPad

Robert Virding

unread,
Sep 16, 2012, 5:03:27 PM9/16/12
to Max Bourinov, Erlang, Roberto Ostinelli
The dying VM will of course not send anything, it's dead. Your node, however, knows of all the links and monitors its processes has with other nodes. It will detect that it has lost contact with the other node and and then send the exit signals and monitor information to its local processes. This may take some time before it happens.

Robert


Roberto Ostinelli

unread,
Sep 25, 2012, 12:27:30 PM9/25/12
to Robert Virding, Erlang
Thank you all for answers.


The dying VM will of course not send anything, it's dead. Your node, however, knows of all the links and monitors its processes has with other nodes. It will detect that it has lost contact with the other node and and then send the exit signals and monitor information to its local processes. This may take some time before it happens.

Robert, could you please clarify what 'some time' may mean? I'm wondering if it's better:

  • to have all processes linked between nodes, and receive the failover signal (may be thousands of them)
  • to have a single 'node monitoring' process which will then deal with a node failure
Any input welcome :)

r.

Mike Oxford

unread,
Sep 25, 2012, 2:05:39 PM9/25/12
to Roberto Ostinelli, Erlang
"Some time" is the net-tick-time, which defaults to 15 seconds I believe.  You can decrease that time by making it smaller (more frequent pings), at the cost of (a potentially much greater) increase in network traffic.  

The issue surrounds "half closed sockets" which is a general TCP problem and less of an Erlang-specific issue.  (eg, it is a problem all networked applications have to deal with, including the upcoming Websockets stuff.)

-mox

Reply all
Reply to author
Forward
0 new messages