Hi All.
There are two hosts on EC2: H1
and H2
. There are tree modules of sbt-project: master
, client
and worker
. Each module implement akka-cluster node, which subscribes to a cluster events and logs them. Also each node logs a cluster state every 1 minute (for debug). The following ports are used for cluster-nodes:master: 2551
, worker: 3000
, client: 5000
The project available at github
The more details about infrastructure: my previous question
A module can be redeployed in H1
or H2
randomly.
There is a strange behavior of the akka-cluster. When one of nodes (for example worker
) is redeployed. The following steps illustrate a history of deploying:
The initial state - when worker
is deployed on H1
and master
and client
are deployed on H2
----[state-of-deploying-0]---
H1 = [worker]
H2 = [master, client]
cluster status: // cluster works correctly
Member(address = akka.tcp://ClusterSystem@H1:3000, status = Up)
Member(address = akka.tcp://ClusterSystem@H2:2551, status = Up)
Member(address = akka.tcp://ClusterSystem@H2:5000, status = Up)
----------------
After that the worker
module has been redeployed on host H2
----[state-of-deploying-1]---
H1 = [-]
H2 = [master, client, worker (Redeployed)]
cluster status: // WRONG cluster state!
Member(address = akka.tcp://ClusterSystem@H1:3000, status = Up) // ???
Member(address = akka.tcp://ClusterSystem@H2:2551, status = Up)
Member(address = akka.tcp://ClusterSystem@H2:3000, status = WeaklyUp)
Member(address = akka.tcp://ClusterSystem@H2:5000, status = Up)
----------------
The above situation happens occasionally. In this case a cluster stores a wrong state of membership and will not repair it:
Member(address = akka.tcp://ClusterSystem@H1:3000, status = Up) // ???
The host H1
doesn't contain any instances of worker
. And > telnet H1 3000
returns connection refused
. But why does the akka-cluster keep storing this wrong info?
---
this question is duplicated from https://stackoverflow.com/questions/48924863