[erlang-questions] Future of epmd

80 views
Skip to first unread message

Dmitry Demeshchuk

unread,
Nov 7, 2012, 2:03:05 AM11/7/12
to erlang-questions
Hello, list.

As you may know, epmd may sometimes misbehave. Loses nodes and doesn't add them back, for example (unless you do some magic, like this: http://sidentdv.livejournal.com/769.html ).

A while ago, Peter Lemenkov got a wonderful idea that epmd may be actually written in Erlang instead. EPMD protocol is very simple, and it's much easier to implement all the failover scenarios in Erlang than in C. So far, here's a prototype of his: https://github.com/lemenkov/erlpmd

When hacking it, I've noticed several things:

1. When we send ALIVE2_REQ and reply with ALIVE2_RESP, we establish a TCP connection. Closing of which is a signal of node disconnection. This approach does have a point, since we can use keep-alive and periodically check that the node is still here on the TCP level. But next, some weird thing follows:

2. When we send other control messages from a node connected to epmd, we establish a new TCP connection, each time. Could use the main connection instead. Was it a design decision or it's just a legacy thing?

3. The client (node) part of epmd seems to be all implemented in C and sealed inside ERTS. However, it seems like this code could be implemented inside the net_kernel module instead (or something similar).


Why bother and switch to Erlang when everything is already written and working? First of all, sometimes it doesn't really work in big clusters (see my first link). And, secondly, using Erlang we can easily extend the protocol. For example, add auto-discovery feature, which has been discussed on the list a lot. Add an ability for a node to reconnect if its TCP session has been terminated for some reason. Add lookups of nodes by prefix (like, "give me all nodes that match mynode@*"). The list can be probably extended further.


Do you think such a thing (with full backwards compatibility, of course) could go upstream? Also, a question for epmd maintainers: is it going to change at all, or the protocol is considered to be full enough for its purposes?

--
Best regards,
Dmitry Demeshchuk

David Mercer

unread,
Nov 7, 2012, 12:59:45 PM11/7/12
to Dmitry Demeshchuk, erlang-questions

This seems like an outstanding idea.  If I understand correctly, you could have the first node that starts up on a host also start the erlpmd application.  If the node running erlpmd goes down, one of the other nodes on the same host starts the erlpmd application.  Do I have this right?

 

Cheers,

 

DBM

Per Hedeland

unread,
Nov 7, 2012, 2:15:54 PM11/7/12
to demes...@gmail.com, erlang-q...@erlang.org
Dmitry Demeshchuk <demes...@gmail.com> wrote:
>
>1. When we send ALIVE2_REQ and reply with ALIVE2_RESP, we establish a TCP
>connection. Closing of which is a signal of node disconnection. This
>approach does have a point, since we can use keep-alive and periodically
>check that the node is still here on the TCP level.

No, the point is rather the opposite - since this is always a local
loopback connection, epmd is guaranteed (by the OS/kernel) to
"immediately" find out that the erlang node died (or disconnected), by
means of socket close (EOF) - no matter how the death came about. TCP
keep-alives, that by necessity incur a delay (and the default is
typically huge) before detection of a problem, are not only inferior but
pointless in this scenario.

--Per Hedeland
_______________________________________________
erlang-questions mailing list
erlang-q...@erlang.org
http://erlang.org/mailman/listinfo/erlang-questions

Dmitry Demeshchuk

unread,
Nov 7, 2012, 2:26:18 PM11/7/12
to Per Hedeland, erlang-questions
Hmm. I thought that something like "kill -9" wouldn't inform us that the socket has been closed until we try to do something with it. But checked – and yes, you are right, keep-alive is even bad in that case, since without it we immediately get a {tcp_closed, Socket} message.


On Wed, Nov 7, 2012 at 11:15 PM, Per Hedeland <p...@hedeland.org> wrote:
Dmitry Demeshchuk <demes...@gmail.com> wrote:
>
>1. When we send ALIVE2_REQ and reply with ALIVE2_RESP, we establish a TCP
>connection. Closing of which is a signal of node disconnection. This
>approach does have a point, since we can use keep-alive and periodically
>check that the node is still here on the TCP level.

No, the point is rather the opposite - since this is always a local
loopback connection, epmd is guaranteed (by the OS/kernel) to
"immediately" find out that the erlang node died (or disconnected), by
means of socket close (EOF) - no matter how the death came about. TCP
keep-alives, that by necessity incur a delay (and the default is
typically huge) before detection of a problem, are not only inferior but
pointless in this scenario.

--Per Hedeland



Kukosa, Tomas

unread,
Nov 8, 2012, 3:58:59 AM11/8/12
to erlang-questions

Hi,

 

BTW what about using SCTP for distribution protocol?

Would not bring it some advatages in high availability area? E.g. bacause of multihoming or posibility of setting timeout and retransmission parameters?

 

Best regards,

  Tomas

 

From: erlang-quest...@erlang.org [mailto:erlang-quest...@erlang.org] On Behalf Of Dmitry Demeshchuk
Sent: Wednesday, November 07, 2012 8:03 AM
To: erlang-questions
Subject: [erlang-questions] Future of epmd

 

Hello, list.

Kenneth Lundin

unread,
Nov 8, 2012, 11:45:41 AM11/8/12
to Dmitry Demeshchuk, erlang-questions

We have discussed having epmd implemented in Erlang several times and think it
would be a good idea for several reasons especially if there is only one Erlang node per host. But it could work even if there are more nodes. It could also be an alternative to run a separate E-node just for the epmd service. We have unfortunately not been able to give this high enough priority yet, so this initiative
is interesting (have not looked at the code and other details)

The benefits with having epmd implemented in Erlang would be:

- Easier to maintain
- Easier to extend
- Easy to prototype other solutions, for example heterogenous distribution,
secure epmd communication via TLS , etc

The client part is already written in Erlang, see the erl_epmd module.

/Kenneth , Erlang/OTP Ericsson

Kenneth Lundin

unread,
Nov 8, 2012, 11:55:32 AM11/8/12
to Kukosa, Tomas, erlang-questions


Den 8 nov 2012 09:59 skrev "Kukosa, Tomas" <tomas....@siemens-enterprise.com>:
>
> Hi,
>
>  
>
> BTW what about using SCTP for distribution protocol?
>
> Would not bring it some advatages in high availability area? E.g. bacause of multihoming or posibility of setting timeout and retransmission parameters?

Yes it would be interesting to have distribution over SCTP, and it is possible to implement with the same plugin approach as the distro over SSL is implemented.

This might have implications on epmd as well and maybe heterogenous distribution
would be of interest as well.

With heterogenous distribution I mean that a node can talk sctp with some other node and talk tcp with yet another. It would require some negotiation and or registration in an extended epmd where a node can say which protocols it supports and prefers.

/Kenneth, Erlang/OTP Ericsson

Patrik Nyblom

unread,
Nov 8, 2012, 12:04:40 PM11/8/12
to erlang-q...@erlang.org
Hi!

On 11/07/2012 08:03 AM, Dmitry Demeshchuk wrote:
Hello, list.

As you may know, epmd may sometimes misbehave. Loses nodes and doesn't add them back, for example (unless you do some magic, like this: http://sidentdv.livejournal.com/769.html ).

First of all, we have no bug reports were epmd looses nodes except if you deliberately kill epmd or deliberately disconnect. I unfortunately cannot read the article you are referring to (the language is not one I understand), so I cannot explain what's going on there.

A while ago, Peter Lemenkov got a wonderful idea that epmd may be actually written in Erlang instead. EPMD protocol is very simple, and it's much easier to implement all the failover scenarios in Erlang than in C. So far, here's a prototype of his: https://github.com/lemenkov/erlpmd

Failover is usually not needed, it's one single process on a machine, it should only stop if the machine stops. What scenario are we talking about here?

As epmd works today, a distributed erlang node connects to a *local* epmd (it's after all just a portmapper, similar to many other portmappers), and tells it what name and port number it has. When the beam process ends (in some way or another) the socket get's closed and epmd is more or less instantly informed. Epmd survives starts and stops of Erlang nodes on the machine and is the single database mapping ports for erlang distribution on the host.

If we were to implement epmd in Erlang with that scheme, the first Erlang node either has to survive for all of the host's lifespan or has to transfer the ownership of the open sockets (ALIVE-sockets) to "the next" node to take over the task of epmd. Note that these nodes may not be in the same cluster, epmd is bound to a machine, not an Erlang cluster. Erlang VM's participating in different Erlang clusters may exist on the same machine. This would be feasible if we had an *extra* Erlang node for port mapping, which of course could be a working solution.

To implement this in Erlang, using the already present distributed Erlang machines, would probably require a different mechanism for registering and unregistering nodes. Looking out for closed sockets will not do, as we will need to monitor nodes that has no connection to us (or they have to re-establish such a connection at least, which is not needed today). Also a reliable takeover by nodes participating in different clusters could be implemented, it is in no way impossible of course. You would also need to reopen the known port when taking over, so there will be a race, or rather a short time with no epmd listening. All clients have to handle that.

Implementing a more simple epmd for a machine with only one Erlang node is far easier and could be useful for small embedded systems. In that case we will not need to change the protocol. Usage will be limited of course.

You could also rewrite epmd in Erlang and have an extra (non distributed) Erlang machine resident in the system (after all, it would be more or less the same thing as having a C program resident). That would not require complicated takeover scenarios, but would increase the memory footprint slightly. An implementation in Erlang could cover both the single VM system and a solution with an extra Erlang machine, which would be nice.

When hacking it, I've noticed several things:

1. When we send ALIVE2_REQ and reply with ALIVE2_RESP, we establish a TCP connection. Closing of which is a signal of node disconnection. This approach does have a point, since we can use keep-alive and periodically check that the node is still here on the TCP level. But next, some weird thing follows:
Note that this is local connections. Keep-alive has nothing to do with it. The loopback detects a close and informs immediately. Keep-alive detects network problems (badly) and is only useful when talking across a real network.


2. When we send other control messages from a node connected to epmd, we establish a new TCP connection, each time. Could use the main connection instead. Was it a design decision or it's just a legacy thing?
When you communicate with epmd after alive is sent, you establish a connection to the epmd *on the host you want to connect to*, which is only the same epmd  as you used for registration if the Erlang node you want to talk to is on the same host as you yourself are. You are looking for a port on the particular machine that your remote Erlang machine resides on. Only in the local case you could reuse your connection, which would only add a special case with very little gain.


3. The client (node) part of epmd seems to be all implemented in C and sealed inside ERTS. However, it seems like this code could be implemented inside the net_kernel module instead (or something similar).
erl_epmd is the module and it's called by net_kernel. No epmd communication except the inet_driver itself is written in C on that side. The epmd daemon is of course written in C, but it's not part of the VM.


Why bother and switch to Erlang when everything is already written and working? First of all, sometimes it doesn't really work in big clusters (see my first link). And, secondly, using Erlang we can easily extend the protocol. For example, add auto-discovery feature, which has been discussed on the list a lot. Add an ability for a node to reconnect if its TCP session has been terminated for some reason. Add lookups of nodes by prefix (like, "give me all nodes that match mynode@*"). The list can be probably extended further.
I think a lot of this should be solved in the client, which is already written in Erlang. Rewriting the server might just add complexity, at least if you want to solve it in the already running distributed nodes, with takeover and whatnot.


Do you think such a thing (with full backwards compatibility, of course) could go upstream? Also, a question for epmd maintainers: is it going to change at all, or the protocol is considered to be full enough for its purposes?
We have thought about a distributed epmd over the years, but have never considered it worth the effort, due to the takeover complexity etc. Portmapping is really basic functionality, you wouldn't want to mess that up. A separate Erlang machine would maybe be a solution, but as epmd is such a simple program, we have not really thought it worth the extra memory footprint.

So it would not be the easiest thing to convince us to take upstream, but given a well thought through solution, we could get rid of some maintenance - Erlang is after all far nicer to maintain than C... One could also make it possible to chose between different epmd solution, in that way we would cover the cases where people would not want an extra Erlang machine for portmapping. More elaborate things could then be experimented with in the Erlang-written epmd.

If you can isolate a bug or explain a malfunction in the current epmd, it would be a great contribution!



--
Best regards,
Dmitry Demeshchuk

Cheers,
/Patrik

David Mercer

unread,
Nov 8, 2012, 2:57:54 PM11/8/12
to Kenneth Lundin, Dmitry Demeshchuk, erlang-questions
On 7 nov 2012, Dmitry Demeshchuk wrote:

> 3. The client (node) part of epmd seems to be all implemented in C and
> sealed inside ERTS. However, it seems like this code could be implemented
> inside the net_kernel module instead (or something similar).

On Thursday, November 08, 2012, Kenneth Lundin wrote:

> The client part is already written in Erlang, see the erl_epmd module.

Just wanted to clarify which it is...

Cheers,

DBM
Reply all
Reply to author
Forward
0 new messages