Galera cluster did not start anymore

3,975 views
Skip to first unread message

Kernel Panic

unread,
Sep 3, 2016, 2:53:05 PM9/3/16
to codership
Hi there guys

I'm new to galera, I was able to configure a three nodes cluster, it was working fine until I shutted down the instances, this my config:


my.cnf

#
# This group is read both both by the client and the server
# use it for options that affect everything
#
[client-server]

#
# include all files from the config directory
#
!includedir /etc/my.cnf.d

[mysqld_safe]
#debug=d,info,error,query:o,/var/log/mysqld.trace
log_error=/var/log/mysql_error.log

[mysqld]
#log_error=/var/log/mysql_error.log

general_log_file        = /var/log/mysql.log
general_log             = 1


 /etc/my.cnf.d/server.cnf
#
# These groups are read by MariaDB server.
# Use it for options that only the server (but not clients) should see
#
# See the examples of server my.cnf files in /usr/share/mysql/
#

# this is read by the standalone daemon and embedded servers
[server]

# this is only for the mysqld standalone daemon
[mysqld]

#
# * Galera-related settings
#
#[galera]
# Mandatory settings
#wsrep_on=ON
#wsrep_provider=
#wsrep_cluster_address=
#binlog_format=row
#default_storage_engine=InnoDB
#innodb_autoinc_lock_mode=2
#
# Allow server to accept connections on all interfaces.
#
#bind-address=0.0.0.0
#
# Optional setting
#wsrep_slave_threads=1
#innodb_flush_log_at_trx_commit=0

# this is only for embedded server

### Start Lumiserv Configuration
[galera]
# Mandatory settings
wsrep_on=ON
wsrep_provider=/usr/lib64/galera/libgalera_smm.so
#wsrep_cluster_address='gcomm://'
wsrep_cluster_address='gcomm://172.31.24.77,172.31.37.80,172.31.4.41'
wsrep_cluster_name='lumiservgalera'
wsrep_node_address='172.31.4.41'
wsrep_node_name='galera3'
wsrep_sst_method=rsync
wsrep_debug=ON
wsrep_log_conflicts=ON
wsrep_dbug_option=ON
binlog_format=row
default_storage_engine=InnoDB
innodb_autoinc_lock_mode=2
bind-address=0.0.0.0

### End Lumiserv Configuration


[embedded]


# This group is only read by MariaDB servers, not by MySQL.
# If you use the same .cnf file for MySQL and MariaDB,
# you can put MariaDB-only options here
[mariadb]

# This group is only read by MariaDB-10.1 servers.
# If you use the same .cnf file for MariaDB of different versions,
# use this group for options that older servers don't understand
[mariadb-10.1]


When I started the instances, the service doesn't start , I've got this error:

Sep  3 18:43:50 ip-172-31-4-41 systemd: Starting MariaDB database server...
Sep  3 18:43:52 ip-172-31-4-41 sh: WSREP: Recovered position 70588c9b-5c03-11e6-bca8-1f3f45fdb070:127
Sep  3 18:43:52 ip-172-31-4-41 mysqld: 2016-09-03 18:43:52 140619638335616 [Note] /usr/sbin/mysqld (mysqld 10.1.16-MariaDB) starting as process 3622 ...
Sep  3 18:43:52 ip-172-31-4-41 mysqld: 2016-09-03 18:43:52 140619638335616 [Note] WSREP: Setting wsrep_ready to 0
Sep  3 18:43:52 ip-172-31-4-41 mysqld: 2016-09-03 18:43:52 140619638335616 [Note] WSREP: Read nil XID from storage engines, skipping position init
Sep  3 18:43:52 ip-172-31-4-41 mysqld: 2016-09-03 18:43:52 140619638335616 [Note] WSREP: wsrep_load(): loading provider library '/usr/lib64/galera/libgalera_smm.so'
Sep  3 18:43:52 ip-172-31-4-41 mysqld: 2016-09-03 18:43:52 140619638335616 [Note] WSREP: wsrep_load(): Galera 25.3.15(r3578) by Codership Oy <in...@codership.com> loaded successfully.
Sep  3 18:43:52 ip-172-31-4-41 mysqld: 2016-09-03 18:43:52 140619638335616 [Note] WSREP: CRC-32C: using hardware acceleration.
Sep  3 18:43:52 ip-172-31-4-41 mysqld: 2016-09-03 18:43:52 140619638335616 [Note] WSREP: Found saved state: 70588c9b-5c03-11e6-bca8-1f3f45fdb070:-1
Sep  3 18:43:52 ip-172-31-4-41 mysqld: 2016-09-03 18:43:52 140619638335616 [Note] WSREP: Passing config to GCS: base_dir = /var/lib/mysql/; base_host = 172.31.4.41; base_port = 4567; cert.log_conflicts = no; debug = no; evs.auto_evict = 0; evs.delay_margin = PT1S; evs.delayed_keep_period = PT30S; evs.inactive_check_period = PT0.5S; evs.inactive_timeout = PT15S; evs.join_retrans_period = PT1S; evs.max_install_timeouts = 3; evs.send_window = 4; evs.stats_report_period = PT1M; evs.suspect_timeout = PT5S; evs.user_send_window = 2; evs.view_forget_timeout = PT24H; gcache.dir = /var/lib/mysql/; gcache.keep_pages_size = 0; gcache.mem_size = 0; gcache.name = /var/lib/mysql//galera.cache; gcache.page_size = 128M; gcache.size = 128M; gcs.fc_debug = 0; gcs.fc_factor = 1.0; gcs.fc_limit = 16; gcs.fc_master_slave = no; gcs.max_packet_size = 64500; gcs.max_throttle = 0.25; gcs.recv_q_hard_limit = 9223372036854775807; gcs.recv_q_soft_limit = 0.25; gcs.sync_donor = no; gmcast.segment = 0; gmcast.version = 0; pc.announce_timeout = PT3S; pc.checksum = false; pc.ignore_quorum = false; pc.ignore_sb = false;
Sep  3 18:43:52 ip-172-31-4-41 mysqld: 2016-09-03 18:43:52 140619389335296 [Note] WSREP: Service thread queue flushed.
Sep  3 18:43:52 ip-172-31-4-41 mysqld: 2016-09-03 18:43:52 140619638335616 [Note] WSREP: Assign initial position for certification: 127, protocol version: -1
Sep  3 18:43:52 ip-172-31-4-41 mysqld: 2016-09-03 18:43:52 140619638335616 [Note] WSREP: wsrep_sst_grab()
Sep  3 18:43:52 ip-172-31-4-41 mysqld: 2016-09-03 18:43:52 140619638335616 [Note] WSREP: Start replication
Sep  3 18:43:52 ip-172-31-4-41 mysqld: 2016-09-03 18:43:52 140619638335616 [Note] WSREP: Setting initial position to 70588c9b-5c03-11e6-bca8-1f3f45fdb070:127
Sep  3 18:43:52 ip-172-31-4-41 mysqld: 2016-09-03 18:43:52 140619638335616 [Note] WSREP: protonet asio version 0
Sep  3 18:43:52 ip-172-31-4-41 mysqld: 2016-09-03 18:43:52 140619638335616 [Note] WSREP: Using CRC-32C for message checksums.
Sep  3 18:43:52 ip-172-31-4-41 mysqld: 2016-09-03 18:43:52 140619638335616 [Note] WSREP: backend: asio
Sep  3 18:43:52 ip-172-31-4-41 mysqld: 2016-09-03 18:43:52 140619638335616 [Warning] WSREP: access file(/var/lib/mysql//gvwstate.dat) failed(No such file or directory)
Sep  3 18:43:52 ip-172-31-4-41 mysqld: 2016-09-03 18:43:52 140619638335616 [Note] WSREP: restore pc from disk failed
Sep  3 18:43:52 ip-172-31-4-41 mysqld: 2016-09-03 18:43:52 140619638335616 [Note] WSREP: GMCast version 0
Sep  3 18:43:52 ip-172-31-4-41 mysqld: 2016-09-03 18:43:52 140619638335616 [Note] WSREP: (5c8ef3f0, 'tcp://0.0.0.0:4567') listening at tcp://0.0.0.0:4567
Sep  3 18:43:52 ip-172-31-4-41 mysqld: 2016-09-03 18:43:52 140619638335616 [Note] WSREP: (5c8ef3f0, 'tcp://0.0.0.0:4567') multicast: , ttl: 1
Sep  3 18:43:52 ip-172-31-4-41 mysqld: 2016-09-03 18:43:52 140619638335616 [Note] WSREP: EVS version 0
Sep  3 18:43:52 ip-172-31-4-41 mysqld: 2016-09-03 18:43:52 140619638335616 [Note] WSREP: gcomm: connecting to group 'lumiservgalera', peer '172.31.24.77:,172.31.37.80:,172.31.4.41:'
Sep  3 18:43:52 ip-172-31-4-41 mysqld: 2016-09-03 18:43:52 140619638335616 [Warning] WSREP: (5c8ef3f0, 'tcp://0.0.0.0:4567') address 'tcp://172.31.4.41:4567' points to own listening address, blacklisting
Sep  3 18:43:55 ip-172-31-4-41 mysqld: 2016-09-03 18:43:55 140619638335616 [Warning] WSREP: no nodes coming from prim view, prim not possible
Sep  3 18:43:55 ip-172-31-4-41 mysqld: 2016-09-03 18:43:55 140619638335616 [Note] WSREP: view(view_id(NON_PRIM,5c8ef3f0,1) memb {
Sep  3 18:43:55 ip-172-31-4-41 mysqld: 5c8ef3f0,0
Sep  3 18:43:55 ip-172-31-4-41 mysqld: } joined {
Sep  3 18:43:55 ip-172-31-4-41 mysqld: } left {
Sep  3 18:43:55 ip-172-31-4-41 mysqld: } partitioned {
Sep  3 18:43:55 ip-172-31-4-41 mysqld: })
Sep  3 18:43:56 ip-172-31-4-41 mysqld: 2016-09-03 18:43:56 140619638335616 [Warning] WSREP: last inactive check more than PT1.5S ago (PT3.50319S), skipping check
Sep  3 18:44:25 ip-172-31-4-41 mysqld: 2016-09-03 18:44:25 140619638335616 [Note] WSREP: view((empty))
Sep  3 18:44:25 ip-172-31-4-41 mysqld: 2016-09-03 18:44:25 140619638335616 [ERROR] WSREP: failed to open gcomm backend connection: 110: failed to reach primary view: 110 (Connection timed out)
Sep  3 18:44:25 ip-172-31-4-41 mysqld: at gcomm/src/pc.cpp:connect():162
Sep  3 18:44:25 ip-172-31-4-41 mysqld: 2016-09-03 18:44:25 140619638335616 [ERROR] WSREP: gcs/src/gcs_core.cpp:gcs_core_open():208: Failed to open backend connection: -110 (Connection timed out)
Sep  3 18:44:25 ip-172-31-4-41 mysqld: 2016-09-03 18:44:25 140619638335616 [ERROR] WSREP: gcs/src/gcs.cpp:gcs_open():1379: Failed to open channel 'lumiservgalera' at 'gcomm://172.31.24.77,172.31.37.80,172.31.4.41': -110 (Connection timed out)
Sep  3 18:44:25 ip-172-31-4-41 mysqld: 2016-09-03 18:44:25 140619638335616 [ERROR] WSREP: gcs connect failed: Connection timed out
Sep  3 18:44:25 ip-172-31-4-41 mysqld: 2016-09-03 18:44:25 140619638335616 [ERROR] WSREP: wsrep::connect(gcomm://172.31.24.77,172.31.37.80,172.31.4.41) failed: 7
Sep  3 18:44:25 ip-172-31-4-41 mysqld: 2016-09-03 18:44:25 140619638335616 [ERROR] Aborting
Sep  3 18:44:26 ip-172-31-4-41 systemd: mariadb.service: main process exited, code=exited, status=1/FAILURE
Sep  3 18:44:26 ip-172-31-4-41 systemd: Failed to start MariaDB database server.
Sep  3 18:44:26 ip-172-31-4-41 systemd: Unit mariadb.service entered failed state.
Sep  3 18:44:26 ip-172-31-4-41 systemd: mariadb.service failed.

systemctl status mysql -l
● mariadb.service - MariaDB database server
   Loaded: loaded (/usr/lib/systemd/system/mariadb.service; enabled; vendor preset: disabled)
  Drop-In: /etc/systemd/system/mariadb.service.d
           └─migrated-from-my.cnf-settings.conf
   Active: failed (Result: exit-code) since Sat 2016-09-03 18:44:26 UTC; 1min 57s ago
  Process: 3015 ExecStartPost=/bin/sh -c systemctl unset-environment _WSREP_START_POSITION (code=exited, status=0/SUCCESS)
  Process: 3622 ExecStart=/usr/sbin/mysqld $MYSQLD_OPTS $_WSREP_NEW_CLUSTER $_WSREP_START_POSITION (code=exited, status=1/FAILURE)
  Process: 3528 ExecStartPre=/bin/sh -c VAR=`/usr/bin/galera_recovery`; [ $? -eq 0 ] &&   systemctl set-environment _WSREP_START_POSITION=$VAR || exit 1 (code=exited, status=0/SUCCESS)
  Process: 3526 ExecStartPre=/bin/sh -c systemctl unset-environment _WSREP_START_POSITION (code=exited, status=0/SUCCESS)
 Main PID: 3622 (code=exited, status=1/FAILURE)
   Status: "MariaDB server is down"

Sep 03 18:44:25 ip-172-31-4-41.eu-west-1.compute.internal mysqld[3622]: at gcomm/src/pc.cpp:connect():162
Sep 03 18:44:25 ip-172-31-4-41.eu-west-1.compute.internal mysqld[3622]: 2016-09-03 18:44:25 140619638335616 [ERROR] WSREP: gcs/src/gcs_core.cpp:gcs_core_open():208: Failed to open backend connection: -110 (Connection timed out)
Sep 03 18:44:25 ip-172-31-4-41.eu-west-1.compute.internal mysqld[3622]: 2016-09-03 18:44:25 140619638335616 [ERROR] WSREP: gcs/src/gcs.cpp:gcs_open():1379: Failed to open channel 'lumiservgalera' at 'gcomm://172.31.24.77,172.31.37.80,172.31.4.41': -110 (Connection timed out)
Sep 03 18:44:25 ip-172-31-4-41.eu-west-1.compute.internal mysqld[3622]: 2016-09-03 18:44:25 140619638335616 [ERROR] WSREP: gcs connect failed: Connection timed out
Sep 03 18:44:25 ip-172-31-4-41.eu-west-1.compute.internal mysqld[3622]: 2016-09-03 18:44:25 140619638335616 [ERROR] WSREP: wsrep::connect(gcomm://172.31.24.77,172.31.37.80,172.31.4.41) failed: 7
Sep 03 18:44:25 ip-172-31-4-41.eu-west-1.compute.internal mysqld[3622]: 2016-09-03 18:44:25 140619638335616 [ERROR] Aborting
Sep 03 18:44:26 ip-172-31-4-41.eu-west-1.compute.internal systemd[1]: mariadb.service: main process exited, code=exited, status=1/FAILURE
Sep 03 18:44:26 ip-172-31-4-41.eu-west-1.compute.internal systemd[1]: Failed to start MariaDB database server.
Sep 03 18:44:26 ip-172-31-4-41.eu-west-1.compute.internal systemd[1]: Unit mariadb.service entered failed state.
Sep 03 18:44:26 ip-172-31-4-41.eu-west-1.compute.internal systemd[1]: mariadb.service failed.



Is there any timeout parameter to let the node wait for a minute till the other nodes are available? I do not get what the problem is.
Any help really appreciated.

Regards


Philip Stoev

unread,
Sep 4, 2016, 2:32:24 AM9/4/16
to Kernel Panic, codership
Hello,

If you shut down all nodes in the cluster, then you need to start them as if
you are creating a "new" cluster from scratch:

1. Find the node that was shut down last
2. Start that node with by using the "bootstrap" or --wsrep-new-cluster .
Keep my.cnf as it is. This will create a new, 1-node cluster
3. Start the other nodes and they will join the new cluster, forming a
3-node cluster.

Philip Stoev
--
You received this message because you are subscribed to the Google Groups
"codership" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to codership-tea...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Net Warrior

unread,
Sep 4, 2016, 8:53:50 AM9/4/16
to Philip Stoev, codership
Hi Philip

What I did was, on one node, leave the wsrep_cluster_address to
gcomm://, started the services, then started the other two nodes, then
back to the first node I added the ip addresses of the other nodes to
wsrep_cluster_address as it was before and restared the services, the
nodes are up and running ,did I do it right?

show status like 'wsrep_%'; shows the three nodes ( wsrep_cluster_size=3)


The next time I'll stop the instance, what's the correct procedure ?
execute a service mysql stop on every node? in any order in particular?

Thank you very much
Regards

Philip Stoev

unread,
Sep 5, 2016, 2:50:50 AM9/5/16
to codership, Net Warrior
Hello,

Yes, your procedure is also correct. Setting wsrep_cluster_address
temporarily to gcomm:// is also a valid approach, however you need to make
sure this value is not left behind in my.cnf but is replaced with proper
list of IPs as soon as possible. Starting the server
with --wsrep-new-cluster provides the same behavior without the need to
modify my.cnf and then revert it back, so it is generally safer against
human error.

When stopping the entire cluster, the order in which nodes are stopped does
not matter. You can use "service mysql stop" to stop the individual nodes.
However, starting the cluster must begin from the node that was shut down
last.

hunter86bg

unread,
Sep 5, 2016, 2:57:57 AM9/5/16
to codership, philip...@galeracluster.com
In case your nodes were ungracefully shut down (power failure , or something similar) you should do the following:

1.Recover the node position (repeat for all nodes):
  • #mysqld_safe --wsrep_recover
Expected output:
>mysqld_safe WSREP: Recovered position 47179976-e790-11e5-9434-a3dabf845eac:448718701
>mysqld_safe mysqld from pid file /share/mysql/node_name.pid ended
  • Check "grastate.dat" file that it has the same UUID and seqno, if not edit them to look like this

# GALERA saved state
version: 2.1
uuid: 47179976-e790-11e5-9434-a3dabf845eac
seqno: 448718701
cert_index:

2.Find which node has the biggest "seqno:" and start this one with "service mysql start --wsrep_new_cluster"
This option instructs the Galera node not to search for other members of the cluster (as it's the first started, there are no others)
3.Start the rest of the nodes ONE BY ONE using:

  • #service mysql start
or
  • #systemctl start mysql
Note: with the second command it could time out in case of SST is started.In order to systemd to show the actual status, you can repeat the command once you have access to the node.
Reply all
Reply to author
Forward
Message has been deleted
0 new messages