Not able to restart the cluster when all the nodes leave cluster gracefully

407 views

Skip to first unread message

Chandra Kapate

unread,

Jul 21, 2017, 12:50:41 PM7/21/17

to codership, cs_k...@yahoo.com

Hi,

I have a question about restarting the cluster when all the nodes are shutdown gracefully.

I might have missed some config setting or might be a mistake. Hoping that some one can help me out.

Really appreciate the help/comments.

Thanks,

Chandra

Sorry for the long message; I wanted to provide as much info as possible....

Issue: I am NOT able to restart the cluster when both the nodes are shut down gracefully with Mariadb.

Also, I see that gvwstate.dat file gets deleted once I stop the service (on both the nodes).

(I am using 2 nodes and trying to setup master-slave kind of setup.)

The provider options set : pc.bootstrap=YES;pc.recovery=TRUE;pc.wait_prim=FALSE;gcache.recover=YES;pc.ignore_sb=TRUE

Steps followed (in this order)

Node1 : galera_new_cluster

Node2 : systemctl start mariadb

Cluster is up and running and I could see cluster_size as 2 etc (other info is attached in the file)

Node 2: systemctl stop mariadb

Node 1: systemctl stop mariadb

I see that gvwstate.dat file is not present on both the systems

node 1: systemctl start mariadb

This fails with the following error:

Job for mariadb.service failed because a timeout was exceeded. See "systemctl status mariadb.service" and "journalctl -xe" for details.

Few things from mysql.log (file)

2017-07-21 13:14:43 139707470456960 [Note] WSREP: Setting wsrep_ready to 0

2017-07-21 13:14:43 139707470456960 [Note] WSREP: Read nil XID from storage engines, skipping position init

2017-07-21 13:14:43 139707470456960 [Note] WSREP: wsrep_load(): loading provider library '/usr/lib64/galera/libgalera_smm.so'

2017-07-21 13:14:44 139707470456960 [Note] WSREP: wsrep_load(): Galera 25.3.20(r3703) by Codership Oy <in...@codership.com> loaded successfully.

2017-07-21 13:14:44 139707470456960 [Note] WSREP: CRC-32C: using hardware acceleration.

2017-07-21 13:14:44 139707470456960 [Note] WSREP: Found saved state: 7edab534-6e15-11e7-9600-9a33d350e94b:0, safe_to_bootsrap: 1

2017-07-21 13:14:44 139707470456960 [Note] WSREP: Recovering GCache ring buffer: version: 1, UUID: 7edab534-6e15-11e7-9600-9a33d350e94b, offset: -1

2017-07-21 13:14:44 139707470456960 [Note] WSREP: GCache::RingBuffer initial scan (134217768 bytes)... 0.0% (0 bytes) complete.

2017-07-21 13:14:44 139707470456960 [Note] WSREP: GCache::RingBuffer initial scan (134217768 bytes)... 100.0% (134217768 bytes) complete.

2017-07-21 13:14:44 139707470456960 [Note] WSREP: Recovering GCache ring buffer: didn't recover any events.

2017-07-21 13:14:44 139707470456960 [Note] WSREP: Passing config to GCS: base_dir = /var/lib/mysql/; base_host = 10.1.10.72; base_port = 4567; cert.log_conflicts = no; debug = no; evs.auto_evict = 0; evs.delay_margin = PT1S; evs.delayed_keep_period = PT30S; evs.inactive_check_period = PT0.5S; evs.inactive_timeout = PT15S; evs.join_retrans_period = PT1S; evs.max_install_timeouts = 3; evs.send_window = 4; evs.stats_report_period = PT1M; evs.suspect_timeout = PT5S; evs.user_send_window = 2; evs.view_forget_timeout = PT24H; gcache.dir = /var/lib/mysql/; gcache.keep_pages_size = 0; gcache.mem_size = 0; gcache.name = /var/lib/mysql//galera.cache; gcache.page_size = 128M; gcache.recover = YES; gcache.size = 128M; gcomm.thread_prio = ; gcs.fc_debug = 0; gcs.fc_factor = 1.0; gcs.fc_limit = 16; gcs.fc_master_slave = no; gcs.max_packet_size = 64500; gcs.max_throttle = 0.25; gcs.recv_q_hard_limit = 9223372036854775807; gcs.recv_q_soft_limit = 0.25; gcs.sync_donor = no; gmcast.segment = 0; gmcast.version = 0; pc.announce_timeout = PT3S; pc.bootstrap = YES; pc.che

2017-07-21 13:14:44 139707470456960 [Note] WSREP: Assign initial position for certification: 0, protocol version: -1

2017-07-21 13:14:44 139707470456960 [Note] WSREP: wsrep_sst_grab()

2017-07-21 13:14:44 139707470456960 [Note] WSREP: Start replication

2017-07-21 13:14:44 139707470456960 [Note] WSREP: Setting initial position to 7edab534-6e15-11e7-9600-9a33d350e94b:0

2017-07-21 13:14:44 139707470456960 [Note] WSREP: protonet asio version 0

2017-07-21 13:14:44 139707470456960 [Note] WSREP: Using CRC-32C for message checksums.

2017-07-21 13:14:44 139707470456960 [Note] WSREP: initializing ssl context

2017-07-21 13:14:44 139707470456960 [Note] WSREP: backend: asio

2017-07-21 13:14:44 139707470456960 [Note] WSREP: gcomm thread scheduling priority set to other:0

2017-07-21 13:14:44 139707470456960 [Warning] WSREP: access file(/var/lib/mysql//gvwstate.dat) failed(No such file or directory)

2017-07-21 13:14:44 139707470456960 [Note] WSREP: restore pc from disk failed

2017-07-21 13:14:44 139707470456960 [Note] WSREP: GMCast version 0

2017-07-21 13:14:44 139707470456960 [Note] WSREP: (905415ad, 'ssl://0.0.0.0:4567') listening at ssl://0.0.0.0:4567

2017-07-21 13:14:44 139707470456960 [Note] WSREP: (905415ad, 'ssl://0.0.0.0:4567') multicast: , ttl: 1

2017-07-21 13:14:44 139707470456960 [Note] WSREP: EVS version 0

2017-07-21 13:14:44 139707470456960 [Note] WSREP: gcomm: connecting to group 'galera', peer '10.1.10.72:,10.1.10.73:'

2017-07-21 13:14:44 139707470456960 [Note] WSREP: SSL handshake successful, remote endpoint ssl://10.1.10.72:41760 local endpoint ssl://10.1.10.72:4567 cipher: AES128-SHA compression:

2017-07-21 13:14:44 139707470456960 [Note] WSREP: SSL handshake successful, remote endpoint ssl://10.1.10.72:4567 local endpoint ssl://10.1.10.72:41760 cipher: AES128-SHA compression:

2017-07-21 13:14:44 139707470456960 [Note] WSREP: (905415ad, 'ssl://0.0.0.0:4567') connection established to 905415ad ssl://10.1.10.72:4567

2017-07-21 13:14:44 139707470456960 [Warning] WSREP: (905415ad, 'ssl://0.0.0.0:4567') address 'ssl://10.1.10.72:4567' points to own listening address, blacklisting

2017-07-21 13:14:47 139707470456960 [Warning] WSREP: no nodes coming from prim view, prim not possible

2017-07-21 13:14:47 139707470456960 [Note] WSREP: view(view_id(NON_PRIM,905415ad,1) memb {

905415ad,0

} joined {

} left {

} partitioned {

})

2017-07-21 13:14:47 139707470456960 [Note] WSREP: gcomm: connected

2017-07-21 13:14:47 139707470456960 [Note] WSREP: Changing maximum packet size to 64500, resulting msg size: 32636

2017-07-21 13:14:47 139707470456960 [Note] WSREP: Shifting CLOSED -> OPEN (TO: 0)

2017-07-21 13:14:47 139707470456960 [Note] WSREP: Opened channel 'galera'

2017-07-21 13:14:47 139707470456960 [Note] WSREP: Waiting for SST to complete.

2017-07-21 13:14:47 139707132389120 [Note] WSREP: New COMPONENT: primary = no, bootstrap = no, my_idx = 0, memb_num = 1

2017-07-21 13:14:47 139707132389120 [Note] WSREP: Flow-control interval: [16, 16]

2017-07-21 13:14:47 139707132389120 [Note] WSREP: Received NON-PRIMARY.

2017-07-21 13:14:47 139707371542272 [Note] WSREP: New cluster view: global state: 7edab534-6e15-11e7-9600-9a33d350e94b:0, view# -1: non-Primary, number of nodes: 1, my index: 0, protocol version -1

2017-07-21 13:14:47 139707371542272 [Note] WSREP: Setting wsrep_ready to 0

2017-07-21 13:14:48 139707140781824 [Note] WSREP: (905415ad, 'ssl://0.0.0.0:4567') connection to peer 905415ad with addr ssl://10.1.10.72:4567 timed out, no messages seen in PT3S

2017-07-21 13:14:48 139707140781824 [Warning] WSREP: last inactive check more than PT1.5S ago (PT3.50597S), skipping check

Node1 : Some of important vars/status from mysql command... from node1

wsrep_provider_version | 25.3.20(r3703)

Variable_name: wsrep_provider_options

Value: base_dir = /var/lib/mysql/; base_host = 10.1.10.72; base_port = 4567; cert.log_conflicts = no; debug = no; evs.auto_evict = 0; evs.causal_keepalive_period = PT1S; evs.debug_log_mask = 0x1; evs.delay_margin = PT1S; evs.delayed_keep_period = PT30S; evs.inactive_check_period = PT0.5S; evs.inactive_timeout = PT15S; evs.info_log_mask = 0; evs.install_timeout = PT7.5S; evs.join_retrans_period = PT1S; evs.keepalive_period = PT1S; evs.max_install_timeouts = 3; evs.send_window = 4; evs.stats_report_period = PT1M; evs.suspect_timeout = PT5S; evs.use_aggregate = true; evs.user_send_window = 2; evs.version = 0; evs.view_forget_timeout = P1D; gcache.dir = /var/lib/mysql/; gcache.keep_pages_size = 0; gcache.mem_size = 0; gcache.name = /var/lib/mysql//galera.cache; gcache.page_size = 128M; gcache.recover = YES; gcache.size = 128M; gcomm.thread_prio = ; gcs.fc_debug = 0; gcs.fc_factor = 1.0; gcs.fc_limit = 16; gcs.fc_master_slave = no; gcs.max_packet_size = 64500; gcs.max_throttle = 0.25; gcs.recv_q_hard_limit = 9223372036854775807; gcs.recv_q_soft_limit = 0.25; gcs.sync_donor = no; gmcast.listen_addr = ssl://0.0.0.0:4567; gmcast.mcast_addr = ; gmcast.mcast_ttl = 1; gmcast.peer_timeout = PT3S; gmcast.segment = 0; gmcast.time_wait = PT5S; gmcast.version = 0; ist.recv_addr = 10.1.10.72; pc.announce_timeout = PT3S; pc.bootstrap = YES; pc.checksum = false; pc.ignore_quorum = false; pc.ignore_sb = true; pc.linger = PT20S; pc.npvo = false; pc.recovery = true; pc.version = 0; pc.wait_prim = FALSE; pc.wait_prim_timeout = PT30S; pc.weight = 1; protonet.backend = asio; protonet.version = 0; repl.causal_read_timeout = PT30S; repl.commit_order = 3; repl.key_format = FLAT8; repl.max_ws_size = 2147483647; repl.proto_max = 7; socket.checksum = 2; socket.recv_buf_size = 212992; socket.ssl = YES; socket.ssl_ca = /opt/Certs/ca-cert.pem; socket.ssl_cert = /opt/Certs/server-cert.pem; socket.ssl_cipher = AES128-SHA; socket.ssl_compression = YES; socket.ssl_key = /opt/Certs/server-key.pem;

Env : 2 Linux CentOuS sytems (Linux vm33_mariadb_72 4.4.49-1.el7.elrepo.x86_64 #1 SMP Wed Feb 15 12:43:41 EST 2017 x86_64 x86_64 x86_64 GNU/Linux)

Mariadb Packages:

MariaDB-compat-10.2.7-1.el7.centos.x86_64

MariaDB-client-10.2.7-1.el7.centos.x86_64

MariaDB-server-10.2.7-1.el7.centos.x86_64

MariaDB-common-10.2.7-1.el7.centos.x86_64

galera-25.3.20-1.rhel7.el7.centos.x86_64

socat-1.7.2.2-5.el7.x86_64

Configs: Important ones.. (others I have attached the file)

[galera]

#wasrep settings

wsrep_on=ON

wsrep_provider=/usr/lib64/galera/libgalera_smm.so

wsrep_cluster_address=gcomm://10.1.10.72,10.1.10.73

wsrep_cluster_name=galera

wsrep_node_address=10.1.10.72

wsrep_node_name=vm33_mariadb_72

#wsrep_sst_method=xtrabackup-v2

wsrep_sst_method=rsync

wsrep_sst_auth=timsgalera:abc123

binlog_format=row

default_storage_engine=InnoDB

innodb_autoinc_lock_mode=2

innodb_flush_log_at_trx_commit=0

bind-address=10.1.10.72

query_cache_size=0

innodb_doublewrite=1

wsrep_log_conflicts=ON wsrep_provider_options="pc.bootstrap=YES;pc.recovery=TRUE;pc.wait_prim=FALSE;gcache.recover=YES;pc.ignore_sb=TRUE;socket.ssl_key=/opt/Certs/server-key.pem;socket.ssl_cert=/opt/Certs/server-cert.pem;socket.ssl_ca=/opt/Certs/ca-cert.pem;socket.ssl_cipher=AES128-SHA"

wsrep_log_conflicts=ON

wsrep_debug=ON

wsrep_notify_cmd=/opt/scripts/nodeStatusChange.sh

StatuAndVarInfo.txt

Jörg Brühe

unread,

Jul 21, 2017, 4:27:50 PM7/21/17

to codersh...@googlegroups.com

Hi Chandra!

"Works as designed":
A Galera node will deny operation if it cannot join any partner node (=
cluster).
The only way to make an isolated node operate is to tell it explicitly
that it is forming a new cluster, like you did at the very start.

"galera_new_cluster" does not tell the node to initialize some data on
disk or similar, rather it tells the node there is no partner (yet) and
the node will be the nucleus of a new cluster.

I'm sure the manual has all that in more detail.

>
> [[...]]

HTH,
Jörg

--
Joerg Bruehe, Senior MySQL Support Engineer, joerg....@fromdual.com
FromDual GmbH, Rebenweg 6, CH - 8610 Uster; phone +41 44 500 58 26
Geschäftsführer: Oliver Sennhauser
Handelsregister-Eintrag: CH-020.4.044.539-3

Anand.S

unread,

Jul 22, 2017, 5:24:30 AM7/22/17

to Chandra Kapate, cs_k...@yahoo.com, codership

When the entire cluster is down, to bring it back online, you should bootstrap the first node with --wsrep-new-cluster flag, that is galera-new-cluster , to let the cluster know who has more recent updates.

Thanks

Anand

--
You received this message because you are subscribed to the Google Groups "codership" group.
To unsubscribe from this group and stop receiving emails from it, send an email to codership-team+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Chandra Kapate

unread,

Jul 23, 2017, 2:09:14 PM7/23/17

to codership, cs_k...@yahoo.com

Thanks Jorg and Anand. Appreciate your reply.

I thought, it could be done because of the following information from the doc.

pc.recovery

When set to TRUE, the node stores the Primary Component state to disk, in the gvwstate.dat ﬁle. The Primary Component can then recover automatically when all nodes that were part of the last saved state reestablish communications with each other.

wsrep_provider_options="pc.recovery=TRUE"

This allows for:

• Automatic recovery from full cluster crashes, such as in the case of a data center power outage.

• Graceful full cluster restarts without the need for explicitly bootstrapping a new Primary Component.

I will use --wsrep-new-cluster option to restart the first node....

My concern is to avoid SST when the second node comes up in such case. The data could be large and I did not want the whole data to be transferred to second

node. If it can be avoided in such case, it will be nice.

Thanks again.

Chandra

Reply all

Reply to author

Forward

0 new messages