[erlang-questions] Slow node replication

14 views
Skip to first unread message

John VanderPol

unread,
Sep 20, 2012, 11:01:37 AM9/20/12
to erlang-q...@erlang.org
In an application that I manage we are currently having issues discovering all of the other erlang nodes in a cluster.  We start up our application and immediately run net_adm:ping/1 to a known node in order to discover all of the other nodes.  The problem we are having is that although the initial ping command is successful the other nodes are not "discovered" by the new node until upwards of 10 minutes have gone by, when I say "discovered" i mean that when nodes() is called the only node returned is the initial node that was pinged.  So for example if nodes A and B are currently running, we start up node C, node C pings node B, then it will take a substantially long period of time to discover node A.

Some debugging nodes:
When the application is initially started up on all of our nodes this is not a problem and nodes discover each other quickly, it only happens after the application has been running for a while.
All other node communication seems to be performing reasonably fast.
We are monitoring our applications with https://github.com/lethain/nagios_erlang which is a erlang plugin for nagios.  It simply starts up an erlang node, pings all of our nodes to ensure they are up an running, and then shuts down.  These test nodes end up in the known nodes list but are mostly never in the connected node list.

Some information about the environment:
Erlang release: R14B03
Number of nodes: ~40
OS: CentOS
Reply all
Reply to author
Forward
0 new messages