Hello.
We have troubles with MDC replication.
CentOS7, your official rpm package, two clusters, first (with a data):
[System Confiuration]
-----------------------------------+----------
Item | Value
-----------------------------------+----------
Basic/Consistency level
-----------------------------------+----------
system version | 1.4.3
cluster Id | devtest_1
DC Id | dev_1
Total replicas | 2
number of successes of R | 1
number of successes of W | 2
number of successes of D | 2
number of rack-awareness replicas | 0
ring size | 2^128
-----------------------------------+----------
Multi DC replication settings
-----------------------------------+----------
[mdcr] max number of joinable DCs | 2
[mdcr] total replicas per a DC | 1
[mdcr] number of successes of R | 1
[mdcr] number of successes of W | 1
[mdcr] number of successes of D | 1
-----------------------------------+----------
Manager RING hash
-----------------------------------+----------
current ring-hash | c53a2d22
previous ring-hash | c53a2d22
-----------------------------------+----------
[State of Node(s)]
-------+------------------------------------------------------+--------------+---------+----------------+----------------+----------------------------
type | node | state | rack id | current ring | prev ring | updated at
-------+------------------------------------------------------+--------------+---------+----------------+----------------+----------------------------
S | storag...@files-s0.dev.local | running | | c53a2d22 | c53a2d22 | 2019-08-09 11:19:17 +0300
S | storag...@files-s1.dev.local | running | | c53a2d22 | c53a2d22 | 2019-08-09 11:19:19 +0300
S | storag...@files-s2.dev.local | running | | c53a2d22 | c53a2d22 | 2019-08-09 11:19:25 +0300
G | gatewa...@files-g0.dev.local | running | | c53a2d22 | c53a2d22 | 2019-08-09 11:19:28 +0300
-------+------------------------------------------------------+--------------+---------+----------------+----------------+----------------------------
second (empty):
[root@files-master ~]# leofs-adm status
[System Confiuration]
-----------------------------------+----------
Item | Value
-----------------------------------+----------
Basic/Consistency level
-----------------------------------+----------
system version | 1.4.3
cluster Id | files_1
DC Id | vk_1
Total replicas | 3
number of successes of R | 1
number of successes of W | 2
number of successes of D | 2
number of rack-awareness replicas | 0
ring size | 2^128
-----------------------------------+----------
Multi DC replication settings
-----------------------------------+----------
[mdcr] max number of joinable DCs | 2
[mdcr] total replicas per a DC | 1
[mdcr] number of successes of R | 1
[mdcr] number of successes of W | 1
[mdcr] number of successes of D | 1
-----------------------------------+----------
Manager RING hash
-----------------------------------+----------
current ring-hash | c7b850ba
previous ring-hash | c7b850ba
-----------------------------------+----------
[State of Node(s)]
-------+----------------------------------------+--------------+---------+----------------+----------------+----------------------------
type | node | state | rack id | current ring | prev ring | updated at
-------+----------------------------------------+--------------+---------+----------------+----------------+----------------------------
S | fi...@s01.vk1.local | running | | c7b850ba | c7b850ba | 2019-08-09 11:35:35 +0300
S | fi...@s02.vk1.local | running | | c7b850ba | c7b850ba | 2019-08-09 11:35:35 +0300
S | fi...@s03.vk1.local | running | | c7b850ba | c7b850ba | 2019-08-09 11:35:35 +0300
S | fi...@s04.vk1.local | running | | c7b850ba | c7b850ba | 2019-08-09 11:35:35 +0300
-------+----------------------------------------+--------------+---------+----------------+----------------+----------------------------
Join to empty second cluster:
leofs-adm join-cluster mana...@files-master.vk1.local:13075 mana...@files-slave.vk1.local:13076
After a join the first cluster becomes dozy, "whereis" runs very slowly.
On second cluster there are errors on a master and a slave like:
[E] mana...@files-master.vk1.local 2019-08-11 20:31:08.416750 +0300 1565544668 null:null 0 Superv
isor leo_rpc_client_manager_1_at_13075_sup had child leo_rpc_client_manager_1_at_13075 started with leo_pod_manager:start_link(leo_rpc_client_
manager_1_at_13075, 16, 16, leo_rpc_client_conn, [manager_1,"files-m1.dev.local",13075,0], #Fun<leo_rpc_client_sup.0.73440599>)
at undefined exit with reason {connection_error,{connection_error,econnrefused}} in context start_error
[W] mana...@files-master.vk1.local 2019-08-11 20:31:08.416872 +0300 1565544668 null:null 0 {module,"leo_rpc_client_sup"},{function,"start_child/3"},{line,106},{file,{{shutdown,{failed_to_start_child,leo_rpc_client_manager_1_at_13075,{connection_error,{connection_error,econnrefused}}}},{child,undefined,leo_rpc_client_manager_1_at_13075,{leo_pod_sup,start_link,[leo_rpc_client_manager_1_at_13075,16,16,leo_rpc_client_conn,[manager_1,"files-m1.dev.local",13075,0],#Fun<leo_rpc_client_sup.0.73440599>]},permanent,10000,supervisor,[leo_pod_sup]}}}
[E] mana...@files-master.vk1.local 2019-08-11 20:31:08.419887 +0300 1565544668 null:null 0 CRASH REPORT Process <0.2537.0> with 1 neighbours exited with reason: {connection_error,{connection_error,econnrefused}} in gen_server:init_it/6 line 344
[E] mana...@files-master.vk1.local 2019-08-11 20:31:08.420365 +0300 1565544668 null:null 0 Supervisor leo_rpc_client_manager_1_at_13075_sup had child leo_rpc_client_manager_1_at_13075 started with leo_pod_manager:start_link(leo_rpc_client_manager_1_at_13075, 16, 16, leo_rpc_client_conn, [manager_1,"files-m1.dev.local",13075,0], #Fun<leo_rpc_client_sup.0.73440599>) at undefined exit with reason {connection_error,{connection_error,econnrefused}} in context start_error
[W] mana...@files-master.vk1.local 2019-08-11 20:31:08.420540 +0300 1565544668 null:null 0 {module,"leo_rpc_client_sup"},{function,"start_child/3"},{line,106},{file,{{shutdown,{failed_to_start_child,leo_rpc_client_manager_1_at_13075,{connection_error,{connection_error,econnrefused}}}},{child,undefined,leo_rpc_client_manager_1_at_13075,{leo_pod_sup,start_link,[leo_rpc_client_manager_1_at_13075,16,16,leo_rpc_client_conn,[manager_1,"files-m1.dev.local",13075,0],#Fun<leo_rpc_client_sup.0.73440599>]},permanent,10000,supervisor,[leo_pod_sup]}}}
[E] mana...@files-master.vk1.local 2019-08-11 20:31:09.838267 +0300 1565544669 gen_server:call 0 gen_server leo_rpc_client_manager terminated with reason: no such process or port in call to gen_server:call(<0.2536.0>, raw_status) in gen_server:call/2 line 204
[E] mana...@files-master.vk1.local 2019-08-11 20:31:09.839009 +0300 1565544669 gen_server:call 0 CRASH REPORT Process leo_rpc_client_manager with 0 neighbours exited with reason: no such process or port in call to gen_server:call(<0.2536.0>, raw_status) in gen_server:terminate/7 line 812
[E] mana...@files-master.vk1.local 2019-08-11 20:31:09.839225 +0300 1565544669 gen_server:call 0 Supervisor leo_rpc_client_sup had child leo_rpc_client_manager started with leo_rpc_client_manager:start_link(5000) at <0.802.0> exit with reason no such process or port in call to gen_server:call(<0.2536.0>, raw_status) in context child_terminated
[E] mana...@files-master.vk1.local 2019-08-11 20:31:14.840035 +0300 1565544674 gen_server:call 0 gen_server leo_rpc_client_manager terminated with reason: no such process or port in call to gen_server:call(<0.2536.0>, raw_status) in gen_server:call/2 line 204
[E] mana...@files-master.vk1.local 2019-08-11 20:31:14.840946 +0300 1565544674 gen_server:call 0 CRASH REPORT Process leo_rpc_client_manager with 0 neighbours exited with reason: no such process or port in call to gen_server:call(<0.2536.0>, raw_status) in gen_server:terminate/7 line 812
and:
[E] mana...@files-slave.vk1.local 2019-08-11 20:31:19.956220 +0300 1565544679 null:null 0 CRASH
REPORT Process <0.2961.0> with 1 neighbours exited with reason: {connection_error,{connection_error,econnrefused}} in gen_server:init_it/6 lin
e 344
[E] mana...@files-slave.vk1.local 2019-08-11 20:31:19.958321 +0300 1565544679 null:null 0 Superv
isor leo_rpc_client_manager_0_at_13076_sup had child leo_rpc_client_manager_0_at_13076 started with leo_pod_manager:start_link(leo_rpc_client_
manager_0_at_13076, 16, 16, leo_rpc_client_conn, [manager_0,"files-m0.dev.local",13076,0], #Fun<leo_rpc_client_sup.0.73440599>)
at undefined exit with reason {connection_error,{connection_error,econnrefused}} in context start_error
[W] mana...@files-slave.vk1.local 2019-08-11 20:31:19.958621 +0300 1565544679 null:null 0 {modul
e,"leo_rpc_client_sup"},{function,"start_child/3"},{line,106},{file,{{shutdown,{failed_to_start_child,leo_rpc_client_manager_0_at_13076,{conne
ction_error,{connection_error,econnrefused}}}},{child,undefined,leo_rpc_client_manager_0_at_13076,{leo_pod_sup,start_link,[leo_rpc_client_mana
ger_0_at_13076,16,16,leo_rpc_client_conn,[manager_0,"files-m0.dev.local",13076,0],#Fun<leo_rpc_client_sup.0.73440599>]},permanen
t,10000,supervisor,[leo_pod_sup]}}}
[E] mana...@files-slave.vk1.local 2019-08-11 20:31:20.163643 +0300 1565544680 gen_server:call 0 gen_se
rver leo_rpc_client_manager terminated with reason: no such process or port in call to gen_server:call(<0.2960.0>, raw_status) in gen_server:c
all/2 line 204
[E] mana...@files-slave.vk1.local 2019-08-11 20:31:20.164152 +0300 1565544680 gen_server:call 0 CRASH
REPORT Process leo_rpc_client_manager with 0 neighbours exited with reason: no such process or port in call to gen_server:call(<0.2960.0>, raw
_status) in gen_server:terminate/7 line 812
[E] mana...@files-slave.vk1.local 2019-08-11 20:31:20.164515 +0300 1565544680 gen_server:call 0 Superv
isor leo_rpc_client_sup had child leo_rpc_client_manager started with leo_rpc_client_manager:start_link(5000) at <0.802.0> exit with reason no
such process or port in call to gen_server:call(<0.2960.0>, raw_status) in context child_terminated
[E] mana...@files-slave.vk1.local 2019-08-11 20:31:25.165433 +0300 1565544685 gen_server:call 0 gen_se
rver leo_rpc_client_manager terminated with reason: no such process or port in call to gen_server:call(<0.2960.0>, raw_status) in gen_server:c
all/2 line 204
[E] mana...@files-slave.vk1.local 2019-08-11 20:31:25.167107 +0300 1565544685 gen_server:call 0 CRASH
REPORT Process leo_rpc_client_manager with 0 neighbours exited with reason: no such process or port in call to gen_server:call(<0.2960.0>, raw
_status) in gen_server:terminate/7 line 812
[E] mana...@files-slave.vk1.local 2019-08-11 20:31:25.167537 +0300 1565544685 gen_server:call 0 Superv
isor leo_rpc_client_sup had child leo_rpc_client_manager started with leo_rpc_client_manager:start_link(5000) at <0.2963.0> exit with reason n
o such process or port in call to gen_server:call(<0.2960.0>, raw_status) in context child_terminated
[E] mana...@files-slave.vk1.local 2019-08-11 20:31:30.167433 +0300 1565544690 gen_server:call 0 gen_se
rver leo_rpc_client_manager terminated with reason: no such process or port in call to gen_server:call(<0.2960.0>, raw_status) in gen_server:c
all/2 line 204
[E] mana...@files-slave.vk1.local 2019-08-11 20:31:30.169010 +0300 1565544690 gen_server:call 0 CRASH
REPORT Process leo_rpc_client_manager with 0 neighbours exited with reason: no such process or port in call to gen_server:call(<0.2960.0>, raw
_status) in gen_server:terminate/7 line 812
[E] mana...@files-slave.vk1.local 2019-08-11 20:31:30.169421 +0300 1565544690 gen_server:call 0 Superv
isor leo_rpc_client_sup had child leo_rpc_client_manager started with leo_rpc_client_manager:start_link(5000) at <0.3005.0> exit with reason n
o such process or port in call to gen_server:call(<0.2960.0>, raw_status) in context child_terminated
A replication does not start.
If you change rpc.server.listen_port on slaves to 13075 errors are gone, but a reptication does not start as well (you can
find all configs, logs and statuses for the case in the attachment). After some time "whereis" returns nothing.
We would be appreciate for any advice.
--
Alexey Kurnosov