Cluster node reconnects

535 views
Skip to first unread message

Behrad Zari

unread,
Nov 4, 2014, 8:57:35 AM11/4/14
to akka...@googlegroups.com
In my three node cluster (akka 2.3.6 - scala 2.10.4) with the config below

cluster {
    seed-nodes = [
      "akka.tcp://a...@127.0.0.1:2552" // using one of the three as seed node
    ]
    auto-down-unreachable-after = 120s
  }

I `Ctrl+C` one of my nodes so that simulate some crash/termination I see

Remoting - Tried to associate with unreachable remote address [akka.tcp://a...@127.0.0.1:2553]. Address is now gated for 5000 ms, all messages to this address will be delivered to dead letters. Reason: Connection refused: /127.0.0.1:2553

but when I restart the process it is ignored to join and they cannot interoperate, and I continue to see the following message:

Cluster Node [akka.tcp://a...@127.0.0.1:2552] - Existing member [UniqueAddress(akka.tcp://a...@127.0.0.1:2553,392261992)] is trying to join, ignoring
13:36:18.964UTC INFO [adp-akka.actor.default-dispatcher-2] Cluster(akka://adp) - Cluster Node [akka.tcp://a...@127.0.0.1:2552] - Marking node(s) as REACHABLE [Member(address = akka.tcp://a...@127.0.0.1:2553, status = Up)]
Cluster Node [akka.tcp://a...@127.0.0.1:2552] - Existing member [UniqueAddress(akka.tcp://a...@127.0.0.1:2553,392261992)] is trying to join, ignoring
Cluster Node [akka.tcp://a...@127.0.0.1:2552] - Existing member [UniqueAddress(akka.tcp://a...@127.0.0.1:2553,392261992)] is trying to join, ignoring
Cluster Node [akka.tcp://a...@127.0.0.1:2552] - Existing member [UniqueAddress(akka.tcp://a...@127.0.0.1:2553,392261992)] is trying to join, ignoring
...



I'd expect cluster to reconnect after one of my node restarts :( 
when I decrease "auto-down-unreachable-after" my crashed node is down in my seed node, so it is quarantined and won't be able to rejoin after startup until both node restart.
I doubt what is the correct pattern for per node restarts in a clustered deployment!?

Björn Antonsson

unread,
Nov 4, 2014, 4:31:21 PM11/4/14
to akka...@googlegroups.com
Hi,

So the way the cluster works currently is that the unreachable node has to be removed (by doing a down on it) before a system with the same address/port is allowed to join the cluster. If you have the auto-down set to a low value and wait with restarting the "crashed" node until you see the master setting it to DOWN, does it work then?

The thing that seems weird in your log is that 127.0.0.1:2552 suddenly marks the node as reachable again instead of just downing it. If the old node had been downd and removed correctly, then the new one with the same address/port should be allowed to connect. There might be an issue with the failure detector and a missmatch between addresses and unique addresses (address:port:uid).

Would it be possible for you to package up a minimal project that we can use to reproduce this?

B/
--
>>>>>>>>>> Read the docs: http://akka.io/docs/
>>>>>>>>>> Check the FAQ: http://doc.akka.io/docs/akka/current/additional/faq.html
>>>>>>>>>> Search the archives: https://groups.google.com/group/akka-user
---
You received this message because you are subscribed to the Google Groups "Akka User List" group.
To unsubscribe from this group and stop receiving emails from it, send an email to akka-user+...@googlegroups.com.
To post to this group, send email to akka...@googlegroups.com.
Visit this group at http://groups.google.com/group/akka-user.
For more options, visit https://groups.google.com/d/optout.

-- 
Björn Antonsson
Typesafe – Reactive Apps on the JVM
twitter: @bantonsson

richard

unread,
Nov 4, 2014, 6:22:53 PM11/4/14
to akka...@googlegroups.com
I am seeing something similar with this github code, based on akka-datareplication, using Akka 2.3.6
(That might be a little too complex for a ticket)

Note that auto-down-unreachable-after is commented out

Started two instances, one on 2551 (the seed) and another on 1234.
Enter text into each instance, which is correctly replicated to each. 
Kill and restart the 1234 instance.

The new 1234 instance receives the current state (from 2551) and continues to
replicate in both directions!

The log on 2551 does indicate a problem
[INFO] [11/04/2014 17:20:07.309] [ClusterSystem-akka.actor.default-dispatcher-20] [Cluster(akka://ClusterSystem)] Cluster Node [akka.tcp://ClusterSystem@localhost:2551] - Existing member [UniqueAddress(akka.tcp://ClusterSystem@localhost:1234,1772853420)] is trying to join, ignoring
[INFO] [11/04/2014 17:20:17.319] [ClusterSystem-akka.actor.default-dispatcher-17] [Cluster(akka://ClusterSystem)] Cluster Node [akka.tcp://ClusterSystem@localhost:2551] - Existing member [UniqueAddress(akka.tcp://ClusterSystem@localhost:1234,1772853420)] is trying to join, ignoring
[INFO] [11/04/2014 17:20:28.310] [ClusterSystem-akka.actor.default-dispatcher-3] [Cluster(akka://ClusterSystem)] Cluster Node [akka.tcp://ClusterSystem@localhost:2551] - Existing member [UniqueAddress(akka.tcp://ClusterSystem@localhost:1234,1772853420)] is trying to join, ignoring



richard

unread,
Nov 4, 2014, 6:28:13 PM11/4/14
to akka...@googlegroups.com
There are curious log entries 

[WARN] [11/04/2014 17:11:52.074] [ClusterSystem-akka.actor.default-dispatcher-24] [akka.tcp://ClusterSystem@localhost:2551/system/cluster/core/daemon] Cluster Node [akka.tcp://ClusterSystem@localhost:2551] - Marking node(s) as UNREACHABLE [Member(address = akka.tcp://ClusterSystem@localhost:1234, status = Up)]

[WARN] [11/04/2014 17:11:53.099] [ClusterSystem-akka.remote.default-remote-dispatcher-5] [akka.tcp://ClusterSystem@localhost:2551/system/endpointManager/reliableEndpointWriter-akka.tcp%3A%2F%2FClusterSystem%40localhost%3A1234-0/endpointWriter] AssociationError [akka.tcp://ClusterSystem@localhost:2551] -> [akka.tcp://ClusterSystem@localhost:1234]: Error [Invalid address: akka.tcp://ClusterSystem@localhost:1234] [
akka.remote.InvalidAssociation: Invalid address: akka.tcp://ClusterSystem@localhost:1234
Caused by: akka.remote.transport.Transport$InvalidAssociationException: Connection refused: localhost/127.0.0.1:1234
]
[WARN] [11/04/2014 17:11:53.102] [ClusterSystem-akka.remote.default-remote-dispatcher-23] [Remoting] Tried to associate with unreachable remote address [akka.tcp://ClusterSystem@localhost:1234]. Address is now gated for 5000 ms, all messages to this address will be delivered to dead letters. Reason: Connection refused: localhost/127.0.0.1:1234
[INFO] [11/04/2014 17:11:53.461] [ClusterSystem-akka.actor.default-dispatcher-4] [Cluster(akka://ClusterSystem)] Cluster Node [akka.tcp://ClusterSystem@localhost:2551] - Existing member [UniqueAddress(akka.tcp://ClusterSystem@localhost:1234,1772853420)] is trying to join, ignoring
...



Behrad Zari

unread,
Nov 5, 2014, 12:37:25 AM11/5/14
to akka...@googlegroups.com


On Wednesday, November 5, 2014 1:01:21 AM UTC+3:30, Björn Antonsson wrote:
Hi,

So the way the cluster works currently is that the unreachable node has to be removed (by doing a down on it) before a system with the same address/port is allowed to join the cluster. If you have the auto-down set to a low value and wait with restarting the "crashed" node until you see the master setting it to DOWN, does it work then?



No, it doesn't even if master marks crashed node as down and remove it, when crashed node is restarted it complains that it is quarantined by the remote system,... and this system should be restarted!!!!! :( 



The thing that seems weird in your log is that 127.0.0.1:2552 suddenly marks the node as reachable again instead of just downing it. If the old node had been downd and removed correctly, then the new one with the same address/port should be allowed to connect. There might be an issue with the failure detector and a missmatch between addresses and unique addresses (address:port:uid).

Would it be possible for you to package up a minimal project that we can use to reproduce this?

I am using akka's Bootable, with one ClusterListener implementation in each process. (I'm not sure if I'm right but I somehow remember my tests of the same case was working earlier in my project, however I can't track back changes to see what has caused this! ) I'll make an empty one


B/

On 4 November 2014 at 14:57:38, Behrad Zari (beh...@gmail.com) wrote:

In my three node cluster (akka 2.3.6 - scala 2.10.4) with the config below

cluster {
    seed-nodes = [
      "akka.tcp://a...@127.0.0.1:2552" // using one of the three as seed node
    ]
    auto-down-unreachable-after = 120s
  }

I `Ctrl+C` one of my nodes so that simulate some crash/termination I see

Remoting - Tried to associate with unreachable remote address [akka.tcp://a...@127.0.0.1:2553]. Address is now gated for 5000 ms, all messages to this address will be delivered to dead letters. Reason: Connection refused: /127.0.0.1:2553

but when I restart the process it is ignored to join and they cannot interoperate, and I continue to see the following message:

Cluster Node [akka.tcp://a...@127.0.0.1:2552] - Existing member [UniqueAddress(akka.tcp://adp@127.0.0.1:2553,392261992)] is trying to join, ignoring
13:36:18.964UTC INFO [adp-akka.actor.default-dispatcher-2] Cluster(akka://adp) - Cluster Node [akka.tcp://a...@127.0.0.1:2552] - Marking node(s) as REACHABLE [Member(address = akka.tcp://a...@127.0.0.1:2553, status = Up)]
Cluster Node [akka.tcp://a...@127.0.0.1:2552] - Existing member [UniqueAddress(akka.tcp://adp@127.0.0.1:2553,392261992)] is trying to join, ignoring
Cluster Node [akka.tcp://a...@127.0.0.1:2552] - Existing member [UniqueAddress(akka.tcp://adp@127.0.0.1:2553,392261992)] is trying to join, ignoring
Cluster Node [akka.tcp://a...@127.0.0.1:2552] - Existing member [UniqueAddress(akka.tcp://adp@127.0.0.1:2553,392261992)] is trying to join, ignoring
...



I'd expect cluster to reconnect after one of my node restarts :( 
when I decrease "auto-down-unreachable-after" my crashed node is down in my seed node, so it is quarantined and won't be able to rejoin after startup until both node restart.
I doubt what is the correct pattern for per node restarts in a clustered deployment!?
--
>>>>>>>>>> Read the docs: http://akka.io/docs/
>>>>>>>>>> Check the FAQ: http://doc.akka.io/docs/akka/current/additional/faq.html
>>>>>>>>>> Search the archives: https://groups.google.com/group/akka-user
---
You received this message because you are subscribed to the Google Groups "Akka User List" group.
To unsubscribe from this group and stop receiving emails from it, send an email to akka-user+...@googlegroups.com.
To post to this group, send email to akka...@googlegroups.com.
Visit this group at http://groups.google.com/group/akka-user.
For more options, visit https://groups.google.com/d/optout.

Björn Antonsson

unread,
Nov 5, 2014, 3:29:49 AM11/5/14
to akka...@googlegroups.com
Hi Richard,

On 5 November 2014 at 00:22:55, richard (harold.ric...@gmail.com) wrote:

I am seeing something similar with this github code, based on akka-datareplication, using Akka 2.3.6
(That might be a little too complex for a ticket)

Note that auto-down-unreachable-after is commented out


If the old node is never downed and removed from the cluster, then the new node can never join.

B/


Started two instances, one on 2551 (the seed) and another on 1234.
Enter text into each instance, which is correctly replicated to each. 
Kill and restart the 1234 instance.

The new 1234 instance receives the current state (from 2551) and continues to
replicate in both directions!

The log on 2551 does indicate a problem
[INFO] [11/04/2014 17:20:07.309] [ClusterSystem-akka.actor.default-dispatcher-20] [Cluster(akka://ClusterSystem)] Cluster Node [akka.tcp://ClusterSystem@localhost:2551] - Existing member [UniqueAddress(akka.tcp://ClusterSystem@localhost:1234,1772853420)] is trying to join, ignoring
[INFO] [11/04/2014 17:20:17.319] [ClusterSystem-akka.actor.default-dispatcher-17] [Cluster(akka://ClusterSystem)] Cluster Node [akka.tcp://ClusterSystem@localhost:2551] - Existing member [UniqueAddress(akka.tcp://ClusterSystem@localhost:1234,1772853420)] is trying to join, ignoring
[INFO] [11/04/2014 17:20:28.310] [ClusterSystem-akka.actor.default-dispatcher-3] [Cluster(akka://ClusterSystem)] Cluster Node [akka.tcp://ClusterSystem@localhost:2551] - Existing member [UniqueAddress(akka.tcp://ClusterSystem@localhost:1234,1772853420)] is trying to join, ignoring



--
>>>>>>>>>> Read the docs: http://akka.io/docs/
>>>>>>>>>> Check the FAQ: http://doc.akka.io/docs/akka/current/additional/faq.html
>>>>>>>>>> Search the archives: https://groups.google.com/group/akka-user
---
You received this message because you are subscribed to the Google Groups "Akka User List" group.
To unsubscribe from this group and stop receiving emails from it, send an email to akka-user+...@googlegroups.com.
To post to this group, send email to akka...@googlegroups.com.
Visit this group at http://groups.google.com/group/akka-user.
For more options, visit https://groups.google.com/d/optout.

Behrad

unread,
Nov 5, 2014, 3:53:03 AM11/5/14
to akka...@googlegroups.com
2014-11-05 11:59 GMT+03:30 Björn Antonsson <bjorn.a...@typesafe.com>:
Hi Richard,

On 5 November 2014 at 00:22:55, richard (harold.ric...@gmail.com) wrote:

I am seeing something similar with this github code, based on akka-datareplication, using Akka 2.3.6
(That might be a little too complex for a ticket)

Note that auto-down-unreachable-after is commented out


If the old node is never downed and removed from the cluster, then the new node can never join.

 
​Does this mean we should always set auto-down to a small value so that we can recover from 
(and reconnect)
​ cluster
 note crashes? What is the "unreachable" -> "reachable state" state change for then !? I'd expect that my node went to unreachable state again is reachable when it's again up in between the failure detection threshold.

It also isn't happening for me, in both cases.


You received this message because you are subscribed to a topic in the Google Groups "Akka User List" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/akka-user/AdRSv2yuwo4/unsubscribe.
To unsubscribe from this group and all its topics, send an email to akka-user+...@googlegroups.com.

To post to this group, send email to akka...@googlegroups.com.
Visit this group at http://groups.google.com/group/akka-user.
For more options, visit https://groups.google.com/d/optout.



--
--Behrad

Björn Antonsson

unread,
Nov 5, 2014, 4:00:45 AM11/5/14
to akka...@googlegroups.com
Hi Behrad,

On 5 November 2014 at 09:53:00, Behrad (beh...@gmail.com) wrote:



2014-11-05 11:59 GMT+03:30 Björn Antonsson <bjorn.a...@typesafe.com>:
Hi Richard,

On 5 November 2014 at 00:22:55, richard (harold.ric...@gmail.com) wrote:

I am seeing something similar with this github code, based on akka-datareplication, using Akka 2.3.6
(That might be a little too complex for a ticket)

Note that auto-down-unreachable-after is commented out


If the old node is never downed and removed from the cluster, then the new node can never join.

  
​Does this mean we should always set auto-down to a small value so that we can recover from 
 (and reconnect) 
​ cluster
 
 note crashes? What is the "unreachable" -> "reachable state" state change for then !? I'd expect that my node went to unreachable state again is reachable when it's again up in between the failure detection threshold.

It also isn't happening for me, in both cases.


If you want to have the nodes automatically be downed is a different issue than the reachability. The states reachabel/unreachable is for a node instance that experiences connection failures (network outages et.c.) but not restarts, while the downing is necessary when a new node with the same address/port as the old one is joining (in effect a restarted actor system).

B/

Patrik Nordwall

unread,
Nov 5, 2014, 4:20:31 AM11/5/14
to akka...@googlegroups.com
I think I understand what is going on and what we can consider to improve.

The heartbeat messages don't include the system uid, and there fore the restarted system starts responding to heartbeat messages that are targeted to the old incarnation. Then the cluster marks it as reachable again, before the auto-down takes affect, i.e. it is never removed from the cluster. The new system tries to join, but that is not possible because the cluster already contains same host:port.

I think this is best solved by including system uid in the heartbeat messages, but that increase the payload size of these messages.

An issue ticket would be good.

Regards,
Patrik

Patrik Nordwall
Typesafe Reactive apps on the JVM
Twitter: @patriknw

Björn Antonsson

unread,
Nov 5, 2014, 4:46:31 AM11/5/14
to akka...@googlegroups.com
Thanks for confirming my suspicion Patrik. A ticket has been created https://github.com/akka/akka/issues/16224.

B/

Patrik Nordwall

unread,
Nov 10, 2014, 9:18:21 AM11/10/14
to akka...@googlegroups.com
Hi again,

My hypothesis of why the node was marked as REACHABLE was wrong. The cluster heartbeat replies include the UID and replies from wrong incarnation are ignored.

I have created a test that simulates the scenario as I have understood it. It behaves as expected, i.e. the restarted node can join after a while when the old incarnation has been removed from the cluster.

Behrad, do you have a sample that we can use to reproduce the issue?

Regards,
Patrik

Behrad Zari

unread,
Nov 10, 2014, 11:43:14 AM11/10/14
to akka...@googlegroups.com
The funny and bad thing is that when I tested my code today it was working!!! (as I said in my previous post it also was working at start but lately I couldn't get it working)
I'm confused since I haven't changed anything related to this :( So, Am i missing a bit of change mine, or it could it depend on 
1) bad termination of previous sbt run's in developments!? (So how could the remoting port be opened if it's not been released)
2) anything related to network/configuration that leads to that misbehave... !?
hum?

My concerns is two-fold:

1) I'm really eager to reproduce that, and will push a test case if I found one

2) there are still unclear points for me in akka clustering philosophy:
I saw node B rejoining my node A seed, after B restarted today, but when it[B] didn't get aware of seed node A restart!!! Why is that happening?
here is both nodes conf:

remote {
    log-remote-lifecycle-events = off
    netty.tcp {
      hostname = "127.0.0.1"
      port = 2552
    }

    transport-failure-detector {
      heartbeat-interval = 30s
      acceptable-heartbeat-pause = 35s
    }

  }

  cluster {
    seed-nodes = [
      "akka.tcp://a...@127.0.0.1:2552"
    ]
    auto-down-unreachable-after = 10s
  }

P.S. can we continue topic on the github issue page? there feels more comfortable for me :)

Patrik Nordwall

unread,
Nov 10, 2014, 12:58:01 PM11/10/14
to akka...@googlegroups.com
On Mon, Nov 10, 2014 at 5:43 PM, Behrad Zari <beh...@gmail.com> wrote:
The funny and bad thing is that when I tested my code today it was working!!! (as I said in my previous post it also was working at start but lately I couldn't get it working)
I'm confused since I haven't changed anything related to this :( So, Am i missing a bit of change mine, or it could it depend on 
1) bad termination of previous sbt run's in developments!? (So how could the remoting port be opened if it's not been released)
2) anything related to network/configuration that leads to that misbehave... !?
hum?

My concerns is two-fold:

1) I'm really eager to reproduce that, and will push a test case if I found one

2) there are still unclear points for me in akka clustering philosophy:
I saw node B rejoining my node A seed, after B restarted today, but when it[B] didn't get aware of seed node A restart!!! Why is that happening?
here is both nodes conf:

remote {
    log-remote-lifecycle-events = off
    netty.tcp {
      hostname = "127.0.0.1"
      port = 2552
    }

    transport-failure-detector {
      heartbeat-interval = 30s
      acceptable-heartbeat-pause = 35s
    }

  }

  cluster {
    seed-nodes = [
      "akka.tcp://a...@127.0.0.1:2552"
    ]
    auto-down-unreachable-after = 10s
  }

P.S. can we continue topic on the github issue page? there feels more comfortable for me :)

Yes, let's continue there.
Describe step-by-step what you do, and supply log files.
Thanks.

Please remove settings for the transport-failure-detector. Should not influence this, but I prefer that we debug this with default settings as much as possible.

/Patrik

richard

unread,
Nov 11, 2014, 10:42:02 AM11/11/14
to akka...@googlegroups.com
Agreed.

However my point was that the two nodes were still exchanging data, even though one is not permitted to join the cluster.

Please see the earlier comment which described that interaction
Reply all
Reply to author
Forward
0 new messages