node timeout error when joining cluster

47 views
Skip to first unread message

Sudha Ram

unread,
Jun 22, 2017, 6:02:56 AM6/22/17
to codership
Hi, 

I've a 3 node cluster setup. And everything is working fine. But when I increase the network delay using "tc qdisc add dev eth0 root netem delay 5000ms" ( 5 seconds) to simulate a remote node ( node 3) , the node won't join the cluster. 

Error log is : 

2017-06-22 14:35:03 14084 [Note] WSREP: Read nil XID from storage engines, skipping position init
2017-06-22 14:35:03 14084 [Note] WSREP: wsrep_load(): loading provider library '/usr/lib/galera/libgalera_smm.so'
2017-06-22 14:35:03 14084 [Note] WSREP: wsrep_load(): Galera 3.20(r7e383f7) by Codership Oy <in...@codership.com> loaded successfully.
2017-06-22 14:35:03 14084 [Note] WSREP: CRC-32C: using hardware acceleration.
2017-06-22 14:35:03 14084 [Note] WSREP: Found saved state: fcc29741-49df-11e7-b78c-3ecc331426c8:-1, safe_to_bootsrap: 0

2017-06-22 14:35:03 14084 [Note] WSREP: Passing config to GCS: base_dir = /var/lib/mysql/; base_host = 192.168.2.253; base_port = 4567; cert.log_conflicts = no; debug = no; evs.auto_evict = 0; evs.delay_margin = PT1S; evs.delayed_keep_period = PT30S; evs.inactive_check_period = PT0.5S; evs.inactive_timeout = PT15S; evs.join_retrans_period = PT1S; evs.max_install_timeouts = 3; evs.send_window = 4; evs.stats_report_period = PT1M; evs.suspect_timeout = PT5S; evs.user_send_window = 2; evs.view_forget_timeout = PT24H; gcache.dir = /var/lib/mysql/; gcache.keep_pages_size = 0; gcache.mem_size = 0; gcache.name = /var/lib/mysql//galera.cache; gcache.page_size = 128M; gcache.recover = no; gcache.size = 128M; gcomm.thread_prio = ; gcs.fc_debug = 0; gcs.fc_factor = 1.0; gcs.fc_limit = 16; gcs.fc_master_slave = no; gcs.max_packet_size = 64500; gcs.max_throttle = 0.25; gcs.recv_q_hard_limit = 9223372036854775807; gcs.recv_q_soft_limit = 0.25; gcs.sync_donor = no; gmcast.peer_timeout = PT10S; gmcast.segment = 1; gmcast.time_wait = PT15S; gmcast.version = 
2017-06-22 14:35:03 14084 [Note] WSREP: GCache history reset: old(fcc29741-49df-11e7-b78c-3ecc331426c8:0) -> new(fcc29741-49df-11e7-b78c-3ecc331426c8:641)
2017-06-22 14:35:03 14084 [Note] WSREP: Assign initial position for certification: 641, protocol version: -1
2017-06-22 14:35:03 14084 [Note] WSREP: wsrep_sst_grab()
2017-06-22 14:35:03 14084 [Note] WSREP: Start replication
2017-06-22 14:35:03 14084 [Note] WSREP: Setting initial position to fcc29741-49df-11e7-b78c-3ecc331426c8:641
2017-06-22 14:35:03 14084 [Note] WSREP: protonet asio version 0
2017-06-22 14:35:03 14084 [Note] WSREP: Using CRC-32C for message checksums.
2017-06-22 14:35:03 14084 [Note] WSREP: initializing ssl context
2017-06-22 14:35:03 14084 [Note] WSREP: backend: asio
2017-06-22 14:35:03 14084 [Note] WSREP: gcomm thread scheduling priority set to other:0 
2017-06-22 14:35:03 14084 [Warning] WSREP: access file(/var/lib/mysql//gvwstate.dat) failed(No such file or directory)
2017-06-22 14:35:03 14084 [Note] WSREP: restore pc from disk failed
2017-06-22 14:35:03 14084 [Note] WSREP: GMCast version 0
2017-06-22 14:35:03 14084 [Note] WSREP: (e1070e3c, 'ssl://0.0.0.0:4567') listening at ssl://0.0.0.0:4567
2017-06-22 14:35:03 14084 [Note] WSREP: (e1070e3c, 'ssl://0.0.0.0:4567') multicast: , ttl: 1
2017-06-22 14:35:03 14084 [Note] WSREP: EVS version 0
2017-06-22 14:35:03 14084 [Note] WSREP: gcomm: connecting to group 'test_cluster', peer '192.168.2.251:,192.168.2.252:,192.168.2.253:'
2017-06-22 14:35:03 14084 [Note] WSREP: SSL handshake successful, remote endpoint ssl://192.168.2.253:59486 local endpoint ssl://192.168.2.253:4567 cipher: AES128-SHA compression: 
2017-06-22 14:35:03 14084 [Note] WSREP: SSL handshake successful, remote endpoint ssl://192.168.2.253:4567 local endpoint ssl://192.168.2.253:59486 cipher: AES128-SHA compression: 
2017-06-22 14:35:03 14084 [Note] WSREP: (e1070e3c, 'ssl://0.0.0.0:4567') connection established to e1070e3c ssl://192.168.2.253:4567
2017-06-22 14:35:03 14084 [Warning] WSREP: (e1070e3c, 'ssl://0.0.0.0:4567') address 'ssl://192.168.2.253:4567' points to own listening address, blacklisting


2017-06-22 14:35:06 14084 [Warning] WSREP: no nodes coming from prim view, prim not possible
2017-06-22 14:35:06 14084 [Note] WSREP: view(view_id(NON_PRIM,e1070e3c,1) memb {
e1070e3c,1
} joined {
} left {
} partitioned {
})
2017-06-22 14:35:07 14084 [Warning] WSREP: last inactive check more than PT1.5S ago (PT3.50834S), skipping check
2017-06-22 14:35:13 14084 [Note] WSREP: (e1070e3c, 'ssl://0.0.0.0:4567') connection to peer 00000000 with addr ssl://192.168.2.252:4567 timed out, no messages seen in PT10S
2017-06-22 14:35:13 14084 [Note] WSREP: (e1070e3c, 'ssl://0.0.0.0:4567') connection to peer e1070e3c with addr ssl://192.168.2.253:4567 timed out, no messages seen in PT10S
2017-06-22 14:35:28 14084 [Note] WSREP: (e1070e3c, 'ssl://0.0.0.0:4567') connection to peer 00000000 with addr ssl://192.168.2.252:4567 timed out, no messages seen in PT10S
2017-06-22 14:35:36 14084 [Note] WSREP: view((empty))
2017-06-22 14:35:36 14084 [ERROR] WSREP: failed to open gcomm backend connection: 110: failed to reach primary view: 110 (Connection timed out)
at gcomm/src/pc.cpp:connect():158
2017-06-22 14:35:36 14084 [ERROR] WSREP: gcs/src/gcs_core.cpp:gcs_core_open():208: Failed to open backend connection: -110 (Connection timed out)
2017-06-22 14:35:36 14084 [ERROR] WSREP: gcs/src/gcs.cpp:gcs_open():1404: Failed to open channel 'test_cluster' at 'gcomm://192.168.2.251,192.168.2.252,192.168.2.253': -110 (Connection timed out)
2017-06-22 14:35:36 14084 [ERROR] WSREP: gcs connect failed: Connection timed out
2017-06-22 14:35:36 14084 [ERROR] WSREP: wsrep::connect(gcomm://192.168.2.251,192.168.2.252,192.168.2.253) failed: 7
2017-06-22 14:35:36 14084 [ERROR] Aborting

2017-06-22 14:35:36 14084 [Note] WSREP: Service disconnected.
2017-06-22 14:35:37 14084 [Note] WSREP: Some threads may fail to exit.
2017-06-22 14:35:37 14084 [Note] Binlog end
2017-06-22 14:35:37 14084 [Note] /usr/sbin/mysqld: Shutdown complete

my galera settings :
[mysqld]
#binlog_format=ROW
#default-storage-engine=innodb
#innodb_autoinc_lock_mode=2
bind-address=0.0.0.0

# Galera Provider Configuration
wsrep_on=ON
wsrep_provider=/usr/lib/galera/libgalera_smm.so

# Galera Cluster Configuration
wsrep_cluster_name="test_cluster"
wsrep_cluster_address="gcomm://192.168.2.251,192.168.2.252,192.168.2.253"

# Galera Synchronization Configuration
wsrep_sst_method=rsync

# Galera Node Configuration
wsrep_node_address="192.168.2.253"
wsrep_node_name="node-ins4"

#ssl settings
wsrep_provider_options="socket.ssl_key=/etc/mysql/certs/server-key.pem;socket.ssl_cert=/etc/mysql/certs/server-cert.pem;socket.ssl_ca=/etc/mysql/certs/ca-cert.pem;socket.checksum=2;socket.ssl_cipher=AES128-SHA;pc.bootstrap=YES;gmcast.segment=2;gmcast.peer_timeout=PT10S;gmcast.time_wait=PT15S"



Is it possible to allow a 3-5- second delay node on cluster ?

Reply all
Reply to author
Forward
0 new messages