cluster won't restart after I stopped all 3 nodes

3,437 views
Skip to first unread message

Ben Hsu

unread,
Apr 15, 2015, 4:45:42 PM4/15/15
to codersh...@googlegroups.com
Hello

I am testing out Galera replication. I installed Galera to 3 nodes (plus one for Cluster Control)  using the severalnines configuration tool.

As part of my tests I ran "service mysqld stop" on all the nodes. Once I did this the cluster won't come back up. Any idea what I am doing wrong? Do I have to start the nodes in a particular order?

Error log is here:

150415 20:00:11 mysqld_safe Starting mysqld daemon with databases from /var/lib/mysql
150415 20:00:11 mysqld_safe WSREP: Running position recovery with --log_error='/var/lib/mysql/wsrep_recovery.7YUiI8' --pid-file='/var/lib/mysql/benhsu-50000-ha-3-recover.pid'
150415 20:00:14 mysqld_safe WSREP: Recovered position 6a6f811a-e2df-11e4-b6f8-1ef40dca8af9:100966
150415 20:00:14 [Note] WSREP: wsrep_start_position var submitted: '6a6f811a-e2df-11e4-b6f8-1ef40dca8af9:100966'
150415 20:00:14 [Note] WSREP: Read nil XID from storage engines, skipping position init
150415 20:00:14 [Note] WSREP: wsrep_load(): loading provider library '/usr/lib64/galera/libgalera_smm.so'
150415 20:00:14 [Note] WSREP: wsrep_load(): Galera 25.2.8(r165) by Codership Oy <in...@codership.com> loaded successfully.
150415 20:00:14 [Note] WSREP: Found saved state: 6a6f811a-e2df-11e4-b6f8-1ef40dca8af9:-1
150415 20:00:14 [Note] WSREP: Reusing existing '/var/lib/mysql//galera.cache'.
150415 20:00:14 [Note] WSREP: Passing config to GCS: base_host = 10.209.1.72; base_port = 4567; cert.log_conflicts = no; gcache.dir = /var/lib/mysql/; gcache.keep_pages_size = 0; gcache.mem_size = 0; gcache.name = /var/lib/mysql//galera.cache; gcache.page_size = 128M; gcache.size = 128M; gcs.fc_debug = 0; gcs.fc_factor = 1; gcs.fc_limit = 16; gcs.fc_master_slave = NO; gcs.max_packet_size = 64500; gcs.max_throttle = 0.25; gcs.recv_q_hard_limit = 9223372036854775807; gcs.recv_q_soft_limit = 0.25; gcs.sync_donor = NO; replicator.causal_read_timeout = PT30S; replicator.commit_order = 3
150415 20:00:14 [Note] WSREP: Assign initial position for certification: 100966, protocol version: -1
150415 20:00:14 [Note] WSREP: wsrep_sst_grab()
150415 20:00:14 [Note] WSREP: Start replication
150415 20:00:14 [Note] WSREP: Setting initial position to 6a6f811a-e2df-11e4-b6f8-1ef40dca8af9:100966
150415 20:00:14 [Note] WSREP: protonet asio version 0
150415 20:00:14 [Note] WSREP: backend: asio
150415 20:00:14 [Note] WSREP: GMCast version 0
150415 20:00:14 [Note] WSREP: (07d02d9b-e3aa-11e4-a577-e664301a2a10, 'tcp://0.0.0.0:4567') listening at tcp://0.0.0.0:4567
150415 20:00:14 [Note] WSREP: (07d02d9b-e3aa-11e4-a577-e664301a2a10, 'tcp://0.0.0.0:4567') multicast: , ttl: 1
150415 20:00:14 [Note] WSREP: EVS version 0
150415 20:00:14 [Note] WSREP: PC version 0
150415 20:00:14 [Note] WSREP: gcomm: connecting to group 'my_wsrep_cluster', peer '10.208.196.213:,10.208.160.32:,10.209.1.72:'
150415 20:00:14 [Warning] WSREP: (07d02d9b-e3aa-11e4-a577-e664301a2a10, 'tcp://0.0.0.0:4567') address 'tcp://10.209.1.72:4567' points to own listening address, blacklisting
150415 20:00:17 [Warning] WSREP: no nodes coming from prim view, prim not possible
150415 20:00:17 [Note] WSREP: view(view_id(NON_PRIM,07d02d9b-e3aa-11e4-a577-e664301a2a10,1) memb {
07d02d9b-e3aa-11e4-a577-e664301a2a10,
} joined {
} left {
} partitioned {
})
150415 20:00:17 [Warning] WSREP: last inactive check more than PT1.5S ago (PT3.50376S), skipping check
150415 20:00:47 [Note] WSREP: view((empty))
150415 20:00:47 [ERROR] WSREP: failed to open gcomm backend connection: 110: failed to reach primary view: 110 (Connection timed out)
at gcomm/src/pc.cpp:connect():141
150415 20:00:47 [ERROR] WSREP: gcs/src/gcs_core.c:gcs_core_open():196: Failed to open backend connection: -110 (Connection timed out)
150415 20:00:47 [ERROR] WSREP: gcs/src/gcs.c:gcs_open():1290: Failed to open channel 'my_wsrep_cluster' at 'gcomm://10.208.196.213,10.208.160.32,10.209.1.72': -110 (Connection timed out)
150415 20:00:47 [ERROR] WSREP: gcs connect failed: Connection timed out
150415 20:00:47 [ERROR] WSREP: wsrep::connect() failed: 7
150415 20:00:47 [ERROR] Aborting

150415 20:00:47 [Note] WSREP: Service disconnected.
150415 20:00:48 [Note] WSREP: Some threads may fail to exit.
150415 20:00:48 [Note] /usr/sbin/mysqld: Shutdown complete

150415 20:00:48 mysqld_safe mysqld from pid file /var/lib/mysql/benhsu-50000-ha-3.pid ended

Philip Stoev

unread,
Apr 15, 2015, 4:50:14 PM4/15/15
to Ben Hsu, codersh...@googlegroups.com
Hello,

Can you check if port 4567 is open and accessible between the machines of
the cluster?

Philip Stoev
--
You received this message because you are subscribed to the Google Groups
"codership" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to codership-tea...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Ben Hsu

unread,
Apr 15, 2015, 4:55:07 PM4/15/15
to Philip Stoev, codersh...@googlegroups.com
Hi Philip

Thank you for responding to my question. It looks like port 4567 is open:

machine 1: nc -l 4567

machine 2: nc -z $machine1_ip 4567
returns 0

On Wed, Apr 15, 2015 at 4:50 PM, Philip Stoev <philip...@galeracluster.com> wrote:
Hello,

Can you check if port 4567 is open and accessible between the machines of the cluster?

Philip Stoev

-----Original Message----- From: Ben Hsu
Sent: Wednesday, April 15, 2015 13:39
To unsubscribe from this group and stop receiving emails from it, send an email to codership-team+unsubscribe@googlegroups.com.

Philip Stoev

unread,
Apr 15, 2015, 5:32:56 PM4/15/15
to Ben Hsu, codersh...@googlegroups.com
Thanks.

What seems to have happened in your case is that in case the entire cluster
goes down, the nodes need to be restarted so that the node that was shut
down last is started first as if it is the first node of a completely new
cluster (with the --wsrep-new-cluster command line option). The other nodes
can then be started and if their wsrep_cluster_address configuration option
in my.cnf contains a reference to the first node, they will join the new
cluster.

To avoid this situation in the future, you can set
wsrep_provider_options='pc.recovery=ON' on all nodes This will instruct the
nodes to keep information about the membership of the cluster in persistent
storage. This allows nodes to be restarted together in any order and if all
the nodes from the original cluster come up, they will automatically form a
new cluster.

Philip Stoev



-----Original Message-----
From: Ben Hsu
Sent: Wednesday, April 15, 2015 13:55
To: Philip Stoev
Cc: codersh...@googlegroups.com
Subject: Re: [codership-team] cluster won't restart after I stopped all 3
nodes


Hi Philip

Thank you for responding to my question. It looks like port 4567 is open:

machine 1: nc -l 4567

machine 2: nc -z $machine1_ip 4567
returns 0


On Wed, Apr 15, 2015 at 4:50 PM, Philip Stoev
<philip...@galeracluster.com> wrote:
Hello,

Can you check if port 4567 is open and accessible between the machines of
the cluster?

Philip Stoev

-----Original Message----- From: Ben Hsu
Sent: Wednesday, April 15, 2015 13:39
To: codersh...@googlegroups.com
email to codership-tea...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.





--
You received this message because you are subscribed to the Google Groups
"codership" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to codership-tea...@googlegroups.com.

Ben Hsu

unread,
Apr 16, 2015, 1:07:27 PM4/16/15
to Philip Stoev, codersh...@googlegroups.com
Hi Philip

Is there any way to tell which node was shut down last? I tried start mysql individually on each of the three nodes, and I get the same error on all 3 of them.

What is supposed to be listening on port 4567? Is it mysqld or some other process? I noticed that nothing was listening on that port right now (I ran my connectivity test by running netcat as a listener)

150416 16:59:57 [Note] WSREP: gcomm: connecting to group 'my_wsrep_cluster', peer '10.208.196.213:,10.208.160.32:,10.209.1.72:'
150416 16:59:57 [Warning] WSREP: (030efa5e-e45a-11e4-b3d3-2b4d8ad52b1d, 'tcp://0.0.0.0:4567') address 'tcp://10.209.1.72:4567' points to own listening address, blacklisting
150416 17:00:00 [Warning] WSREP: no nodes coming from prim view, prim not possible
150416 17:00:00 [Note] WSREP: view(view_id(NON_PRIM,030efa5e-e45a-11e4-b3d3-2b4d8ad52b1d,1) memb {
030efa5e-e45a-11e4-b3d3-2b4d8ad52b1d,
} joined {
} left {
} partitioned {

150416 17:00:01 [Warning] WSREP: last inactive check more than PT1.5S ago (PT3.50374S), skipping check
150416 17:00:30 [Note] WSREP: view((empty))
150416 17:00:30 [ERROR] WSREP: failed to open gcomm backend connection: 110: failed to reach primary view: 110 (Connection timed out)
at gcomm/src/pc.cpp:connect():141
150416 17:00:30 [ERROR] WSREP: gcs/src/gcs_core.c:gcs_core_open():196: Failed to open backend connection: -110 (Connection timed out)
150416 17:00:30 [ERROR] WSREP: gcs/src/gcs.c:gcs_open():1290: Failed to open channel 'my_wsrep_cluster' at 'gcomm://10.208.196.213,10.208.160.32,10.209.1.72': -110 (Connection timed out)
150416 17:00:30 [ERROR] WSREP: gcs connect failed: Connection timed out
150416 17:00:30 [ERROR] WSREP: wsrep::connect() failed: 7
150416 17:00:30 [ERROR] Aborting


On Wed, Apr 15, 2015 at 5:32 PM, Philip Stoev <philip...@galeracluster.com> wrote:
Thanks.

What seems to have happened in your case is that in case the entire cluster goes down, the nodes need to be restarted so that the node that was shut down last is started first as if it is the first node of a completely new cluster (with the --wsrep-new-cluster command line option). The other nodes can then be started and if their wsrep_cluster_address configuration option in my.cnf contains a reference to the first node, they will join the new cluster.

To avoid this situation in the future, you can set wsrep_provider_options='pc.recovery=ON' on all nodes This will instruct the nodes to keep information about the membership of the cluster in persistent storage. This allows nodes to be restarted together in any order and if all the nodes from the original cluster come up, they will automatically form a new cluster.

Philip Stoev



-----Original Message----- From: Ben Hsu
Sent: Wednesday, April 15, 2015 13:55
To: Philip Stoev

Subject: Re: [codership-team] cluster won't restart after I stopped all 3 nodes



Hi Philip

Thank you for responding to my question. It looks like port 4567 is open:

machine 1: nc -l 4567

machine 2: nc -z $machine1_ip 4567
returns 0


On Wed, Apr 15, 2015 at 4:50 PM, Philip Stoev <philip.stoev@galeracluster.com> wrote:
Hello,

Can you check if port 4567 is open and accessible between the machines of the cluster?

Philip Stoev

-----Original Message----- From: Ben Hsu
Sent: Wednesday, April 15, 2015 13:39
To unsubscribe from this group and stop receiving emails from it, send an email to codership-team+unsubscribe@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.





--
You received this message because you are subscribed to the Google Groups "codership" group.
To unsubscribe from this group and stop receiving emails from it, send an email to codership-team+unsubscribe@googlegroups.com.

Sergey Mishin

unread,
Apr 16, 2015, 3:05:59 PM4/16/15
to codersh...@googlegroups.com, philip...@galeracluster.com
Hello,
sounds like you stopped whole cluster. If you did it at once, you need just need start one node with bootstrap, then others with simple start.

Subject: Re: [codership-team] cluster won't restart after I stopped all 3 nodes



Hi Philip

Thank you for responding to my question. It looks like port 4567 is open:

machine 1: nc -l 4567

machine 2: nc -z $machine1_ip 4567
returns 0


On Wed, Apr 15, 2015 at 4:50 PM, Philip Stoev <philip...@galeracluster.com> wrote:
Hello,

Can you check if port 4567 is open and accessible between the machines of the cluster?

Philip Stoev

-----Original Message----- From: Ben Hsu
Sent: Wednesday, April 15, 2015 13:39
To unsubscribe from this group and stop receiving emails from it, send an email to codership-tea...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.





--
You received this message because you are subscribed to the Google Groups "codership" group.
To unsubscribe from this group and stop receiving emails from it, send an email to codership-tea...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages