Galera crashes MariaDB after update from 10.1.24 to 10.1.28

117 views
Skip to first unread message

Lammert Bies

unread,
Oct 10, 2017, 3:11:11 PM10/10/17
to codership
Hi All,

I run a configuration of five MariaDB servers running Galera cluster to replicate. The servers are located in separate data centers and running MariaDB 10.1.24. The environment for all servers is Centos 7. MariaDB has been installed from the official repository with yum.

The configuration has run flawlessly for several months. I have added and deleted servers about once a month and state transfer always went automatically without errors.

Recently I used yum to update two of the servers to MariaDB to 10.1.28.

Since the update the two updated servers refuse to start, even when I try to start them as new primary server. There is a somewhat unhelful "ERROR" message when I rusystemctl status mysql.service. The other three servers are still running the old configuration and doing fine, but I would like to add these two servers again to the cluster and upgrade the whole cluster to 10.1.28.

This is the error message from systemctl status mysql.service:

Oct 10 20:57:59 cluster001 systemd[1]: Starting LSB: start and stop MySQL...
Oct 10 20:57:59 cluster001 mysql[9773]: Starting MySQL.171010 20:57:59 mysqld_safe Logging to '/var/lib/mysql//cluster001.err'.
Oct 10 20:57:59 cluster001 mysql[9773]: 171010 20:57:59 mysqld_safe Starting mysqld daemon with databases from /var/lib/mysql/
Oct 10 20:58:03 cluster001 mysql[9773]: . ERROR!
Oct 10 20:58:03 cluster001 systemd[1]: mysql.service: control process exited, code=exited status=1
Oct 10 20:58:03 cluster001 systemd[1]: Failed to start LSB: start and stop MySQL.
Oct 10 20:58:03 cluster001 systemd[1]: Unit mysql.service entered failed state.
Oct 10 20:58:03 cluster001 systemd[1]: mysql.service failed.

The error file in /var/lib/mysql doesn't give much more information:

171010 20:57:59 mysqld_safe Starting mysqld daemon with databases from /var/lib/mysql/
171010 20:57:59 mysqld_safe WSREP: Running position recovery with --disable-log-error  --pid-file='/var/lib/mysql//cluster001-recover.pid'
2017-10-10 20:57:59 140432891656448 [Note] /usr/sbin/mysqld (mysqld 10.1.28-MariaDB) starting as process 9924 ...
2017-10-10 20:57:59 140432891656448 [Note] InnoDB: Using mutexes to ref count buffer pool pages
2017-10-10 20:57:59 140432891656448 [Note] InnoDB: The InnoDB memory heap is disabled
2017-10-10 20:57:59 140432891656448 [Note] InnoDB: Mutexes and rw_locks use GCC atomic builtins
2017-10-10 20:57:59 140432891656448 [Note] InnoDB: GCC builtin __atomic_thread_fence() is used for memory barrier
2017-10-10 20:57:59 140432891656448 [Note] InnoDB: Compressed tables use zlib 1.2.7
2017-10-10 20:57:59 140432891656448 [Note] InnoDB: Using Linux native AIO
2017-10-10 20:57:59 140432891656448 [Note] InnoDB: Using SSE crc32 instructions
2017-10-10 20:57:59 140432891656448 [Note] InnoDB: Initializing buffer pool, size = 200.0M
2017-10-10 20:57:59 140432891656448 [Note] InnoDB: Completed initialization of buffer pool
2017-10-10 20:57:59 140432891656448 [Note] InnoDB: Highest supported file format is Barracuda.
2017-10-10 20:57:59 140432891656448 [Note] InnoDB: 128 rollback segment(s) are active.
2017-10-10 20:57:59 140432891656448 [Note] InnoDB: Waiting for purge to start
2017-10-10 20:57:59 140432891656448 [Note] InnoDB:  Percona XtraDB (http://www.percona.com) 5.6.36-82.2 started; log sequence number 144493478647
2017-10-10 20:57:59 140432891656448 [Warning] InnoDB: Skipping buffer pool dump/restore during wsrep recovery.
2017-10-10 20:57:59 140432891656448 [Note] Plugin 'FEEDBACK' is disabled.
2017-10-10 20:58:00 140432891656448 [Note] Server socket created on IP: '::'.
2017-10-10 20:58:00 140432891656448 [Note] WSREP: Recovered position: 749f0e44-dfcb-11e6-bf8b-a33c517bf8f7:13575567

And then the log file stops without any visible error.

My server configuration for this instance is:

#
# These groups are read by MariaDB server.
# Use it for options that only the server (but not clients) should see
#
# See the examples of server my.cnf files in /usr/share/mysql/
#
# this is read by the standalone daemon and embedded servers
[server]
# this is only for the mysqld standalone daemon
[mysqld]
datadir=/var/lib/mysql/
ssl-ca=/etc/ssl/mysql/ca-cert.pem
ssl-key=/etc/ssl/mysql/server-key.pem
ssl-cert=/etc/ssl/mysql/server-cert.pem
innodb_buffer_pool_size=200M
#
# * Galera-related settings
#
[galera]
# Mandatory settings
wsrep_on=ON
wsrep_provider=/usr/lib64/galera/libgalera_smm.so
# wsrep_cluster_address="gcomm://cluster001,cluster002,cluster003,cluster004,cluster005"
wsrep_cluster_address="gcomm://"
binlog_format=row
default_storage_engine=InnoDB
innodb_autoinc_lock_mode=2
#
# Allow server to accept connections on all interfaces.
#
bind-address=::
#
# Optional setting
#wsrep_slave_threads=1
#innodb_flush_log_at_trx_commit=0
wsrep_provider_options="socket.ssl_key=/etc/ssl/mysql/server-key.pem;socket.ssl_cert=/etc/ssl/mysql/server-cert.pem;socket.ssl_ca=/etc/ssl/mysql/ca-cert.pem;gmcast.listen_addr=tcp://[::]:4567;pc.weight=100;"
wsrep_cluster_name="Linocomm_cluster"
wsrep_node_address=cluster001
wsrep_sst_method=xtrabackup-v2
wsrep_sst_auth="cluster:password"
[sst]
encrypt=3
tca=/etc/ssl/mysql/ca-cert.pem
tkey=/etc/ssl/mysql/server-key.pem
tcert=/etc/ssl/mysql/server-cert.pem
# sockopt=,pf=ip6
# this is only for embedded server
[embedded]
# This group is only read by MariaDB servers, not by MySQL.
# If you use the same .cnf file for MySQL and MariaDB,
# you can put MariaDB-only options here
[mariadb]
# This group is only read by MariaDB-10.1 servers.
# If you use the same .cnf file for MariaDB of different versions,
# use this group for options that older servers don't understand
[mariadb-10.1]

I tried switching to IPv4 by changing :: and [::] to 0.0.0.0 but didn't have any effect.

Disabeling replication with wsrep=OFF causes the servers to start without issues. This leads me to believe that the problem is somewhere in the galera part of the setup, rather than in the newer version of MariaDB itself.

Any help to debug this issue would be appreciated

alexey.y...@galeracluster.com

unread,
Oct 11, 2017, 3:01:06 AM10/11/17
to Lammert Bies, codership
Try to start the server manually in the foreground (not through service
command). If it starts fine then you may want to check the difference in
startup scripts between 10.1.24 and 10.1.28

On 2017-10-11 02:11, Lammert Bies wrote:
> Hi All,
>
> I run a configuration of five MariaDB servers running Galera cluster to
> replicate. The servers are located in separate data centers and running
> MariaDB 10.1.24. The environment for all servers is Centos 7. MariaDB
> has
> been installed from the official repository with yum.
>
> The configuration has run flawlessly for several months. I have added
> and
> deleted servers about once a month and state transfer always went
> automatically without errors.
>
> Recently I used yum to update two of the servers to MariaDB to 10.1.28.
>
> Since the update the two updated servers refuse to start, even when I
> try
> to start them as new primary server. There is a somewhat unhelful
> "ERROR"
> message when I run *systemctl status mysql.service*. The other three
> servers are still running the old configuration and doing fine, but I
> would
> like to add these two servers again to the cluster and upgrade the
> whole
> cluster to 10.1.28.
>
> This is the error message from *systemctl status mysql.service*:
>
> Oct 10 20:57:59 cluster001 systemd[1]: Starting LSB: start and stop
> MySQL...
>> Oct 10 20:57:59 cluster001 mysql[9773]: Starting MySQL.171010 20:57:59
>> mysqld_safe Logging to '/var/lib/mysql//cluster001.err'.
>> Oct 10 20:57:59 cluster001 mysql[9773]: 171010 20:57:59 mysqld_safe
>> Starting mysqld daemon with databases from /var/lib/mysql/
>> Oct 10 20:58:03 cluster001 mysql[9773]: . ERROR!
>> Oct 10 20:58:03 cluster001 systemd[1]: mysql.service: control process
>> exited, code=exited status=1
>> Oct 10 20:58:03 cluster001 systemd[1]: Failed to start LSB: start and
>> stop
>> MySQL.
>> Oct 10 20:58:03 cluster001 systemd[1]: Unit mysql.service entered
>> failed
>> state.
>> Oct 10 20:58:03 cluster001 systemd[1]: mysql.service failed.
>
>
> The error file in */var/lib/mysql* doesn't give much more information:
> I tried switching to IPv4 by changing* :: *and *[::]* to *0.0.0.0* but
> didn't have any effect.
>
> Disabeling replication with *wsrep=OFF* causes the servers to start

Lammert Bies

unread,
Oct 11, 2017, 9:56:07 PM10/11/17
to codership
Thanks for the suggestion Alexey,

I will try to debug the situation when starting manually and report here back with the results.

JFD

unread,
Oct 13, 2017, 2:49:59 AM10/13/17
to codership
See the following ticket:
mysqld_safe could not start Galera node after upgrade to 10.1.28 or 10.2.9

MariaDB 10.1.28 can not start using Galera. A fix to mysqld_safe can be done as a workaround. Otherwise 10.1.29 will be released with a fix.

Hope this helps,
JF

Lammert Bies

unread,
Oct 15, 2017, 3:30:46 PM10/15/17
to codership
Thanks for the additional info JFD,

The bug you linked to looks the same issue which I am having.
My servers are currently running with a manual start with nohup mysqld -u mysql & and I'll wait until 10.1.29 is released.

Lammert Bies

unread,
Nov 20, 2017, 6:36:35 PM11/20/17
to codership
A follow up on this issue:

I today upgraded to 10.1.29 and the problem disappeared. Thanks again Alexey and JFD for your assistance.
Reply all
Reply to author
Forward
0 new messages