[erlang-questions] Automatically reconnecting nodes when they come back online

55 views
Skip to first unread message

Scott Thoman

unread,
Apr 26, 2013, 1:00:35 PM4/26/13
to erlang-q...@erlang.org
To all who know more about this than I do:

First, I'm just beginning to learn about Erlang/OTP so I figured I'd
use to implement something useful.

Part of what I'd like to build will involve a "conductor" controller
node that directs some other "player" nodes to all do something at
approximately the same time - ultimately to actually test the
operation of another piece of distributed software. As part of those
operations, I expect the player nodes may sometimes crash (actually
cause a Windows BSOD in some cases) and then eventually come back to
life.

What I'm wondering about is what some folks have found to be good ways
of getting nodes to rejoin the cluster when they come back to life.
They way I'm thinking about it now, is that the player nodes will be
passive in the sense that they won't actively connect to any other
nodes - they'll only get connected when the conductor node invites
them in. I'm also not looking for fault tolerance on the conductor
node at this point; if that one fails badly I'll just get some coffee
and rerun the scenario again.

My first two thoughts were:
1. When the conductor node connects up the player nodes it would also
spawn a process whose sole job is to periodically ping the other nodes
to ensure they're connected. Then when one goes down, those pings
will just fail during that time but when the node comes back a ping
will reconnect it to the other nodes. All this time, I'd be
monitoring the node up/down messages.
2. I'd start by monitoring all the nodes as the conductor connects
them and when receiving a node down message, spawn a process whose job
it is to periodically ping only that node only until it comes back.

Are there some good practices out there for systems that want to
behave like this?

Thanks in advance,

/stt
_______________________________________________
erlang-questions mailing list
erlang-q...@erlang.org
http://erlang.org/mailman/listinfo/erlang-questions

Joseph Wayne Norton

unread,
Apr 26, 2013, 1:13:11 PM4/26/13
to Scott Thoman, erlang-q...@erlang.org
I don't have a direct answer to your question.

However, are you aware of the slave module?

Some of the recipe(s) in this module might be of use to you.

Scott Thoman

unread,
Apr 26, 2013, 1:35:47 PM4/26/13
to Joseph Wayne Norton, erlang-q...@erlang.org
I'm not aware of it yet but I'll take a look...

Thanks,

Scott Thoman

unread,
Apr 26, 2013, 3:55:34 PM4/26/13
to Joseph Wayne Norton, erlang-q...@erlang.org
It looks like the slave thing won't quite work in my case since I'll
likely be in a heterogeneous environment where the controller is linux
but, unfortunately, the machines-under-test will be Windows.

I will keep that in mind, though, if I need that functionality now
that I know it exists. :)

Ignas Vyšniauskas

unread,
Apr 29, 2013, 2:41:50 AM4/29/13
to Scott Thoman, erlang-q...@erlang.org
On 04/26/2013 07:00 PM, Scott Thoman wrote:
> My first two thoughts were: 1. When the conductor node connects up
> the player nodes it would also spawn a process whose sole job is to
> periodically ping the other nodes to ensure they're connected. Then
> when one goes down, those pings will just fail during that time but
> when the node comes back a ping will reconnect it to the other
> nodes. All this time, I'd be monitoring the node up/down messages. 2.
> I'd start by monitoring all the nodes as the conductor connects them
> and when receiving a node down message, spawn a process whose job it
> is to periodically ping only that node only until it comes back.

* you don't need a pinging mechanism, just use the existing
`net_kernel:monitor(true)` and handle the events.
* if you can afford a fixed node name for at least the "conductor" node,
then you can do something along the lines you described yourself --
should be trivial.
* otherwise you can try to hack things using `net_adm:world()` or
something like that for "dynamic" node discovery

Also, take a look at the `{sync_nodes_optional, NodeList}` parameter of
`kernel`.

--
Ignas

Dmitry Kolesnikov

unread,
Apr 29, 2013, 3:24:10 AM4/29/13
to Scott Thoman, erlang-q...@erlang.org
Hello,

I am using the following approach for similar issue.

- net_kernel:monitor allow your process to receive nodeup/nodedown events.

- the player nodes requires a list of 'seed' nodes at config file. It should connect those seed nodes at boot time. If none of seeds is connected then player node has to die with alarm.

- Dmitry
Reply all
Reply to author
Forward
0 new messages