MultiDC replication issues

22 views
Skip to first unread message

Alex K

unread,
Aug 12, 2019, 5:05:17 AM8/12/19
to LeoProject.LeoFS
Hello.

We have troubles with MDC replication.
CentOS7, your official rpm package,  two clusters, first (with a data):
 [System Confiuration]
-----------------------------------+----------
 Item                              | Value    
-----------------------------------+----------
 Basic/Consistency level
-----------------------------------+----------
                    system version | 1.4.3
                        cluster Id | devtest_1
                             DC Id | dev_1
                    Total replicas | 2
          number of successes of R | 1
          number of successes of W | 2
          number of successes of D | 2
 number of rack-awareness replicas | 0
                         ring size | 2^128
-----------------------------------+----------
 Multi DC replication settings
-----------------------------------+----------
 [mdcr] max number of joinable DCs | 2
 [mdcr] total replicas per a DC    | 1
 [mdcr] number of successes of R   | 1
 [mdcr] number of successes of W   | 1
 [mdcr] number of successes of D   | 1
-----------------------------------+----------
 Manager RING hash
-----------------------------------+----------
                 current ring-hash | c53a2d22
                previous ring-hash | c53a2d22
-----------------------------------+----------
 [State of Node(s)]
-------+------------------------------------------------------+--------------+---------+----------------+----------------+----------------------------
 type  |                         node                         |    state     | rack id |  current ring  |   prev ring    |          updated at         
-------+------------------------------------------------------+--------------+---------+----------------+----------------+----------------------------
  S    | storag...@files-s0.dev.local      | running      |         | c53a2d22       | c53a2d22       | 2019-08-09 11:19:17 +0300
  S    | storag...@files-s1.dev.local      | running      |         | c53a2d22       | c53a2d22       | 2019-08-09 11:19:19 +0300
  S    | storag...@files-s2.dev.local      | running      |         | c53a2d22       | c53a2d22       | 2019-08-09 11:19:25 +0300
  G    | gatewa...@files-g0.dev.local      | running      |         | c53a2d22       | c53a2d22       | 2019-08-09 11:19:28 +0300
-------+------------------------------------------------------+--------------+---------+----------------+----------------+----------------------------

second (empty):

[root@files-master ~]# leofs-adm status
 [System Confiuration]
-----------------------------------+----------
 Item                              | Value    
-----------------------------------+----------
 Basic/Consistency level
-----------------------------------+----------
                    system version | 1.4.3
                        cluster Id | files_1
                             DC Id | vk_1
                    Total replicas | 3
          number of successes of R | 1
          number of successes of W | 2
          number of successes of D | 2
 number of rack-awareness replicas | 0
                         ring size | 2^128
-----------------------------------+----------
 Multi DC replication settings
-----------------------------------+----------
 [mdcr] max number of joinable DCs | 2
 [mdcr] total replicas per a DC    | 1
 [mdcr] number of successes of R   | 1
 [mdcr] number of successes of W   | 1
 [mdcr] number of successes of D   | 1
-----------------------------------+----------
 Manager RING hash
-----------------------------------+----------
                 current ring-hash | c7b850ba
                previous ring-hash | c7b850ba
-----------------------------------+----------
 [State of Node(s)]
-------+----------------------------------------+--------------+---------+----------------+----------------+----------------------------
 type  |                  node                  |    state     | rack id |  current ring  |   prev ring    |          updated at         
-------+----------------------------------------+--------------+---------+----------------+----------------+----------------------------
  S    | fi...@s01.vk1.local      | running      |         | c7b850ba       | c7b850ba       | 2019-08-09 11:35:35 +0300
  S    | fi...@s02.vk1.local      | running      |         | c7b850ba       | c7b850ba       | 2019-08-09 11:35:35 +0300
  S    | fi...@s03.vk1.local      | running      |         | c7b850ba       | c7b850ba       | 2019-08-09 11:35:35 +0300
  S    | fi...@s04.vk1.local      | running      |         | c7b850ba       | c7b850ba       | 2019-08-09 11:35:35 +0300
-------+----------------------------------------+--------------+---------+----------------+----------------+----------------------------


Join to empty second cluster:
leofs-adm join-cluster mana...@files-master.vk1.local:13075 mana...@files-slave.vk1.local:13076

After a join the first cluster becomes dozy, "whereis" runs very slowly.
On second cluster there are errors on a master and a slave like:

[E]     mana...@files-master.vk1.local  2019-08-11 20:31:08.416750 +0300        1565544668      null:null       0       Superv
isor leo_rpc_client_manager_1_at_13075_sup had child leo_rpc_client_manager_1_at_13075 started with leo_pod_manager:start_link(leo_rpc_client_
manager_1_at_13075, 16, 16, leo_rpc_client_conn, [manager_1,"files-m1.dev.local",13075,0], #Fun<leo_rpc_client_sup.0.73440599>) 
at undefined exit with reason {connection_error,{connection_error,econnrefused}} in context start_error
[W]     mana...@files-master.vk1.local  2019-08-11 20:31:08.416872 +0300        1565544668      null:null       0       {module,"leo_rpc_client_sup"},{function,"start_child/3"},{line,106},{file,{{shutdown,{failed_to_start_child,leo_rpc_client_manager_1_at_13075,{connection_error,{connection_error,econnrefused}}}},{child,undefined,leo_rpc_client_manager_1_at_13075,{leo_pod_sup,start_link,[leo_rpc_client_manager_1_at_13075,16,16,leo_rpc_client_conn,[manager_1,"files-m1.dev.local",13075,0],#Fun<leo_rpc_client_sup.0.73440599>]},permanent,10000,supervisor,[leo_pod_sup]}}}
[E]     mana...@files-master.vk1.local  2019-08-11 20:31:08.419887 +0300        1565544668      null:null       0       CRASH REPORT Process <0.2537.0> with 1 neighbours exited with reason: {connection_error,{connection_error,econnrefused}} in gen_server:init_it/6 line 344
[E]     mana...@files-master.vk1.local  2019-08-11 20:31:08.420365 +0300        1565544668      null:null       0       Supervisor leo_rpc_client_manager_1_at_13075_sup had child leo_rpc_client_manager_1_at_13075 started with leo_pod_manager:start_link(leo_rpc_client_manager_1_at_13075, 16, 16, leo_rpc_client_conn, [manager_1,"files-m1.dev.local",13075,0], #Fun<leo_rpc_client_sup.0.73440599>) at undefined exit with reason {connection_error,{connection_error,econnrefused}} in context start_error
[W]     mana...@files-master.vk1.local  2019-08-11 20:31:08.420540 +0300        1565544668      null:null       0       {module,"leo_rpc_client_sup"},{function,"start_child/3"},{line,106},{file,{{shutdown,{failed_to_start_child,leo_rpc_client_manager_1_at_13075,{connection_error,{connection_error,econnrefused}}}},{child,undefined,leo_rpc_client_manager_1_at_13075,{leo_pod_sup,start_link,[leo_rpc_client_manager_1_at_13075,16,16,leo_rpc_client_conn,[manager_1,"files-m1.dev.local",13075,0],#Fun<leo_rpc_client_sup.0.73440599>]},permanent,10000,supervisor,[leo_pod_sup]}}}
[E]     mana...@files-master.vk1.local  2019-08-11 20:31:09.838267 +0300        1565544669      gen_server:call 0       gen_server leo_rpc_client_manager terminated with reason: no such process or port in call to gen_server:call(<0.2536.0>, raw_status) in gen_server:call/2 line 204
[E]     mana...@files-master.vk1.local  2019-08-11 20:31:09.839009 +0300        1565544669      gen_server:call 0       CRASH REPORT Process leo_rpc_client_manager with 0 neighbours exited with reason: no such process or port in call to gen_server:call(<0.2536.0>, raw_status) in gen_server:terminate/7 line 812
[E]     mana...@files-master.vk1.local  2019-08-11 20:31:09.839225 +0300        1565544669      gen_server:call 0       Supervisor leo_rpc_client_sup had child leo_rpc_client_manager started with leo_rpc_client_manager:start_link(5000) at <0.802.0> exit with reason no such process or port in call to gen_server:call(<0.2536.0>, raw_status) in context child_terminated
[E]     mana...@files-master.vk1.local  2019-08-11 20:31:14.840035 +0300        1565544674      gen_server:call 0       gen_server leo_rpc_client_manager terminated with reason: no such process or port in call to gen_server:call(<0.2536.0>, raw_status) in gen_server:call/2 line 204
[E]     mana...@files-master.vk1.local  2019-08-11 20:31:14.840946 +0300        1565544674      gen_server:call 0       CRASH REPORT Process leo_rpc_client_manager with 0 neighbours exited with reason: no such process or port in call to gen_server:call(<0.2536.0>, raw_status) in gen_server:terminate/7 line 812


and:

[E]     mana...@files-slave.vk1.local   2019-08-11 20:31:19.956220 +0300        1565544679      null:null       0       CRASH 
REPORT Process <0.2961.0> with 1 neighbours exited with reason: {connection_error,{connection_error,econnrefused}} in gen_server:init_it/6 lin
e 344
[E]     mana...@files-slave.vk1.local   2019-08-11 20:31:19.958321 +0300        1565544679      null:null       0       Superv
isor leo_rpc_client_manager_0_at_13076_sup had child leo_rpc_client_manager_0_at_13076 started with leo_pod_manager:start_link(leo_rpc_client_
manager_0_at_13076, 16, 16, leo_rpc_client_conn, [manager_0,"files-m0.dev.local",13076,0], #Fun<leo_rpc_client_sup.0.73440599>) 
at undefined exit with reason {connection_error,{connection_error,econnrefused}} in context start_error
[W]     mana...@files-slave.vk1.local   2019-08-11 20:31:19.958621 +0300        1565544679      null:null       0       {modul
e,"leo_rpc_client_sup"},{function,"start_child/3"},{line,106},{file,{{shutdown,{failed_to_start_child,leo_rpc_client_manager_0_at_13076,{conne
ction_error,{connection_error,econnrefused}}}},{child,undefined,leo_rpc_client_manager_0_at_13076,{leo_pod_sup,start_link,[leo_rpc_client_mana
ger_0_at_13076,16,16,leo_rpc_client_conn,[manager_0,"files-m0.dev.local",13076,0],#Fun<leo_rpc_client_sup.0.73440599>]},permanen
t,10000,supervisor,[leo_pod_sup]}}}
[E]     mana...@files-slave.vk1.local   2019-08-11 20:31:20.163643 +0300        1565544680      gen_server:call 0       gen_se
rver leo_rpc_client_manager terminated with reason: no such process or port in call to gen_server:call(<0.2960.0>, raw_status) in gen_server:c
all/2 line 204
[E]     mana...@files-slave.vk1.local   2019-08-11 20:31:20.164152 +0300        1565544680      gen_server:call 0       CRASH 
REPORT Process leo_rpc_client_manager with 0 neighbours exited with reason: no such process or port in call to gen_server:call(<0.2960.0>, raw
_status) in gen_server:terminate/7 line 812
[E]     mana...@files-slave.vk1.local   2019-08-11 20:31:20.164515 +0300        1565544680      gen_server:call 0       Superv
isor leo_rpc_client_sup had child leo_rpc_client_manager started with leo_rpc_client_manager:start_link(5000) at <0.802.0> exit with reason no
 such process or port in call to gen_server:call(<0.2960.0>, raw_status) in context child_terminated
[E]     mana...@files-slave.vk1.local   2019-08-11 20:31:25.165433 +0300        1565544685      gen_server:call 0       gen_se
rver leo_rpc_client_manager terminated with reason: no such process or port in call to gen_server:call(<0.2960.0>, raw_status) in gen_server:c
all/2 line 204
[E]     mana...@files-slave.vk1.local   2019-08-11 20:31:25.167107 +0300        1565544685      gen_server:call 0       CRASH 
REPORT Process leo_rpc_client_manager with 0 neighbours exited with reason: no such process or port in call to gen_server:call(<0.2960.0>, raw
_status) in gen_server:terminate/7 line 812
[E]     mana...@files-slave.vk1.local   2019-08-11 20:31:25.167537 +0300        1565544685      gen_server:call 0       Superv
isor leo_rpc_client_sup had child leo_rpc_client_manager started with leo_rpc_client_manager:start_link(5000) at <0.2963.0> exit with reason n
o such process or port in call to gen_server:call(<0.2960.0>, raw_status) in context child_terminated
[E]     mana...@files-slave.vk1.local   2019-08-11 20:31:30.167433 +0300        1565544690      gen_server:call 0       gen_se
rver leo_rpc_client_manager terminated with reason: no such process or port in call to gen_server:call(<0.2960.0>, raw_status) in gen_server:c
all/2 line 204
[E]     mana...@files-slave.vk1.local   2019-08-11 20:31:30.169010 +0300        1565544690      gen_server:call 0       CRASH 
REPORT Process leo_rpc_client_manager with 0 neighbours exited with reason: no such process or port in call to gen_server:call(<0.2960.0>, raw
_status) in gen_server:terminate/7 line 812
[E]     mana...@files-slave.vk1.local   2019-08-11 20:31:30.169421 +0300        1565544690      gen_server:call 0       Superv
isor leo_rpc_client_sup had child leo_rpc_client_manager started with leo_rpc_client_manager:start_link(5000) at <0.3005.0> exit with reason n
o such process or port in call to gen_server:call(<0.2960.0>, raw_status) in context child_terminated

A replication does not start.
If you change rpc.server.listen_port on slaves to 13075 errors are gone, but a reptication does not start as well (you can
find all configs, logs and statuses for the case in the attachment). After some time "whereis" returns nothing.

We would be appreciate for any advice.

--
Alexey Kurnosov

mdc.tgz
Reply all
Reply to author
Forward
0 new messages