Error reading relay log event: slave SQL thread was killed

2,147 views

Skip to first unread message

Luiz Goncalves

unread,

Jun 28, 2016, 9:05:57 AM6/28/16

to PRM-discuss

Hi yves,

I'm facing an weird issue, which seems to be around for while.

I have a master/slave pacemaker cluster running version 5.5.49 ( the latest one available )

Percona-Server-client-55-5.5.49-rel37.9.el7.x86_64

Percona-Server-shared-55-5.5.49-rel37.9.el7.x86_64

Percona-Server-server-55-5.5.49-rel37.9.el7.x86_64

The cluster was working good with no issues. I was able to move the master from db1 to db2 and from db2 to db1 normally.

Then I realised that the replication was not working good anymore. (Attached goes the error logs for db1 and db2)

I've executed the cmd below on db2 (the current slave) in order to fix the replication issue. After executing this, the issue get's fixed according to SHOW SLAVE STATUS as you can see below on db2 as well.

STOP SLAVE;

SET GLOBAL SQL_SLAVE_SKIP_COUNTER=1;

START SLAVE;

mysql> SHOW SLAVE STATUS\G

*************************** 1. row ***************************

Slave_IO_State: Waiting for master to send event

Master_Host: 10.1.51.14

Master_User: repl_user

Master_Port: 3306

Connect_Retry: 60

Master_Log_File: mysqld-bin.000081

Read_Master_Log_Pos: 92197903

Relay_Log_File: mysqld-relay-bin.000002

Relay_Log_Pos: 5899715

Relay_Master_Log_File: mysqld-bin.000081

Slave_IO_Running: Yes

Slave_SQL_Running: Yes

Replicate_Do_DB:

Replicate_Ignore_DB:

Replicate_Do_Table:

Replicate_Ignore_Table:

Replicate_Wild_Do_Table:

Replicate_Wild_Ignore_Table:

Last_Errno: 0

Last_Error:

Skip_Counter: 0

Exec_Master_Log_Pos: 92197903

Relay_Log_Space: 5899872

Until_Condition: None

Until_Log_File:

Until_Log_Pos: 0

Master_SSL_Allowed: No

Master_SSL_CA_File:

Master_SSL_CA_Path:

Master_SSL_Cert:

Master_SSL_Cipher:

Master_SSL_Key:

Seconds_Behind_Master: 0

Master_SSL_Verify_Server_Cert: No

Last_IO_Errno: 0

Last_IO_Error:

Last_SQL_Errno: 0

Last_SQL_Error:

Replicate_Ignore_Server_Ids:

Master_Server_Id: 1

1 row in set (0.00 sec)

#my config's

# crm configure show

node 1: mysql55db1.xyz.com \

attributes p_mysql_mysql_master_IP=10.1.51.14 \

attributes standby=off

node 2: mysql55db2.xyz.com \

attributes p_mysql_mysql_master_IP=10.1.51.15 \

attributes standby=off

primitive fencedb1 stonith:fence_rhevm \

params port=mysql_5.5_db1 ipaddr=10.1.51.1 login="admin@internal" inet4_only=1 passwd=“xxxx“ ssl=1 ssl_insecure=1 power_wait=5 power_timeout=25 delay=3 pcmk_host_list=mysql55db1.xyz.com pcmk_off_retries=1000 \

op monitor interval=60s \

meta

primitive p_mysql ocf:percona:mysql_prm \

params config="/etc/my.cnf" pid="/var/run/mysqld/mysqld.pid" socket="/var/lib/mysql/mysql.sock" replication_user=repl_user replication_passwd=xxxx test_user=test_user test_passwd=xxxx max_slave_lag=15 evict_outdated_slaves=false binary="/usr/sbin/mysqld" async_stop=1 prm_binlog_parser_path="/usr/local/bin/prm_binlog_parser" \

op monitor interval=5s role=Master timeout=30s OCF_CHECK_LEVEL=1 \

op monitor interval=2s role=Slave timeout=30s OCF_CHECK_LEVEL=1 \

op start interval=0 timeout=900s \

op stop interval=0 timeout=900s

primitive reader_vip_1 IPaddr2 \

params ip=10.1.51.103 nic=eth0 \

op monitor interval=10s \

meta target-role=Started

primitive writer_vip IPaddr2 \

params ip=10.1.51.102 nic=eth0 \

op monitor interval=10s \

meta target-role=Started

ms ms_MySQL p_mysql \

meta master-max=1 master-node-max=1 clone-max=2 clone-node-max=1 notify=true globally-unique=false target-role=Started is-managed=true

location location-reader_vip_1-1 reader_vip_1 \

rule -inf: readable eq 0

order ms_MySQL_promote_before_vip inf: ms_MySQL:promote writer_vip:start

colocation reader_vip_1_dislike_writer_vip -200: reader_vip_1 writer_vip

colocation writer_vip_on_master inf: writer_vip ms_MySQL:Master

property cib-bootstrap-options: \

have-watchdog=false \

dc-version=1.1.13-10.el7_2.2-44eb2dd \

cluster-infrastructure=corosync \

cluster-name=mysql55cluster \

no-quorum-policy=ignore \

stonith-enabled=true \

last-lrm-refresh=1467003018

property mysql_replication: \

p_mysql_REPL_STATUS="mysqld-bin.000081|92702817|1073741824" \

p_mysql_REPL_INFO="10.1.51.14|mysqld-bin.000081|86298442" \

#crm status

Stack: corosync

Current DC: mysql55db1.xyz.com (version 1.1.13-10.el7_2.2-44eb2dd) - partition with quorum

2 nodes and 6 resources configured

Online: [ mysql55db1.xyz.com mysql55db2.xyz.com ]

Full list of resources:

reader_vip_1 (ocf::heartbeat:IPaddr2): Started mysql55db2.xyz.com

writer_vip (ocf::heartbeat:IPaddr2): Started mysql55db1.xyz.com

Master/Slave Set: ms_MySQL [p_mysql]

Masters: [ mysql55db1.xyz.com ]

Slaves: [ mysql55db2.xyz.com ]

fencedb1 (stonith:fence_rhevm): Started mysql55db2.xyz.com

fencedb2 (stonith:fence_rhevm): Started mysql55db1.xyz.com

Failed Actions:

* p_mysql_promote_0 on mysql55db2.xyz.com 'unknown error' (1): call=276, status=Timed Out, exitreason='none',

last-rc-change='Tue Jun 28 10:14:00 2016', queued=0ms, exec=20002ms

^ This is the error I'm facing now!

When trying to move the master to db2 i've executed on db1 "crm node standby" and the p_mysql_promote_0 to db2 has failed as you can see above . So I've enabled again the db1 with "crm node online" and db1 move back to master normally. The status above is the result after the db1 get's back online.

So, what do you think it's the issue? Is it a timeout issue or what?

Also, I've decided to ask you, because I've realised some facts as you can see below. So, I'm confused.

The issue seems to be related to the issues https://github.com/percona/percona-pacemaker-agents/issues/9 and https://github.com/percona/percona-pacemaker-agents/issues/12. The symptoms are the almost the same, but the "fix" seems to be in place, since the rpm build was (39ec4f3 on Apr 1, 2015, according to github).

I've also realised that the version on master is not in sync with the rpm under the packages/build, as you can see below. I have installed this rpm (39ec4f3 on Apr 1, 2015)

So,

1 - Should I build another rpm from github source in order to apply the following changes on my production environment ?

2 - Is it related to the issue that I'm facing ?

--- /usr/lib/ocf/resource.d/percona/mysql_prm 2015-03-31 18:17:52.000000000 -0300 <--- version install by https://github.com/percona/percona-pacemaker-agents/tree/master/packages/build/percona-agents-1.0.0-1.x86_64.rpm (39ec4f3 on Apr 1, 2015)

+++ mysql_prm 2016-06-28 08:39:04.024000000 -0300 <--- version on master!

@@ -578,7 +578,7 @@

cp "$binlog_files" /tmp

let nlines=maxEntries*3+6 # below return 3 lines per Xid + need to skip last 2

- startAt=`mysqlbinlog -vvv --base64-output=DECODE-ROWS $1 | \

+ startAt=`$MYSQLBINLOG -vvv --base64-output=DECODE-ROWS $1 | \

grep 'Xid =' -A2 | grep -v '\-\-' | tail -n $nlines | \

egrep 'Xid|\# at [0-9]{2,10}' | tac | grep -A2 Xid | \

grep '# at ' | tail -n 1 | rev | cut -d' ' -f1 | rev`

@@ -879,10 +879,26 @@

# still use the seconds behind master value to set our

# master preference.

local master_pref

- master_pref=$((${OCF_RESKEY_max_slave_lag}-${secs_behind}))

- if [ $master_pref -lt 0 ]; then

- # Sanitize a below-zero preference to just zero

+ test $secs_behind -eq 0 2>/dev/null

+ if [ $? -eq 2 ]; then

+ # SBM is undefined or not an integer

master_pref=0

+ set_reader_attr 0

+ else

+ master_pref=$((${OCF_RESKEY_max_slave_lag}-${secs_behind}))

+ if [ $master_pref -lt 0 ]; then

+ # Sanitize a below-zero preference to just zero

+ master_pref=0

+ fi

+ if [ $secs_behind -gt $OCF_RESKEY_max_slave_lag ]; then

+ set_reader_attr 0

+ else

+ set_reader_attr 1

+ fi

+ #Edge case verification, check if on the right master

+ set_master nologging

# Is the datadir almost full

@@ -893,22 +909,6 @@

set_master_score -2147483640

- # is the slave ok to have a VIP on it

- test $secs_behind -eq 0 2>/dev/null

- if [ $? -eq 2 ]; then

- set_reader_attr 0

- else

- if [ $secs_behind -gt $OCF_RESKEY_max_slave_lag ]; then

- set_reader_attr 0

- else

- set_reader_attr 1

- #Edge case verification, check if on the right master

- set_master nologging

- fi

ocf_log debug "MySQL instance running as a replication slave"

else

# Instance produced an empty "SHOW SLAVE STATUS" output --

@@ -1356,7 +1356,8 @@

override_async_stop=0

- if [ -n $2 ]; then

+ test $2 -eq 0 2>/dev/null

+ if [ $? -ne 2 ]; then

override_async_stop=$2

@@ -1424,6 +1425,7 @@

if [ "$last_crash_ts" -ne "0" ]; then

if [ `date +%s` -lt "$((${last_crash_ts}+3600))" ]; then

# too soon, multiple crash, let's error out

+ set_master_score 0

return $OCF_NOT_RUNNING

@@ -1441,6 +1443,7 @@

else

$CRM_ATTR -l reboot --name ${INSTANCE_ATTR_NAME}_last_crash -v `date +%s`

# OCF_ERR_ARGS is a hard error, won't wait for restart

+ set_master_score 0

return $OCF_ERR_ARGS

@@ -2126,25 +2129,28 @@

demote_host=`echo $OCF_RESKEY_CRM_meta_notify_demote_uname|tr -d " "`

if [ $demote_host = ${HOSTNAME} ]; then

ocf_log info "post-demote notification for $demote_host"

- set_read_only on

- if [ $? -ne 0 ]; then

- ocf_log err "Failed to set read-only";

- return $OCF_ERR_GENERIC;

- fi

- # Must kill all existing user threads because they are still Read/write

- # in order for the slaves to complete the read of binlogs

- local tmpfile

- tmpfile=`mktemp ${HA_RSCTMP}/threads.${OCF_RESOURCE_INSTANCE}.XXXXXX`

- mysql_run -Q -sw -O $MYSQL $MYSQL_OPTIONS_REPL \

- -e "SHOW PROCESSLIST" > $tmpfile

- for thread in `awk '$0 !~ /Binlog Dump|system user|event_scheduler|SHOW PROCESSLIST/ && $0 ~ /^[0-9]/ {print $1}' $tmpfile`

+ while ! get_read_only

+ # This may timeout (5s) if there are long running queries (issue #44)

+ set_read_only on

+ # Must kill all existing user threads because they are still Read/write

+ # in order for the slaves to complete the read of binlogs

+ local tmpfile

+ tmpfile=`mktemp ${HA_RSCTMP}/threads.${OCF_RESOURCE_INSTANCE}.XXXXXX`

+ mysql_run -Q -sw -O $MYSQL $MYSQL_OPTIONS_REPL \

+ -e "SHOW PROCESSLIST" > $tmpfile

+ local killthreads=""

+ for thread in `awk '$0 !~ /Binlog Dump|system user|event_scheduler|SHOW PROCESSLIST/ && $0 ~ /^[0-9]/ {print $1}' $tmpfile`

+ do

+ killthreads="${killthreads} KILL ${thread};"

+ done

mysql_run -Q -sw $MYSQL $MYSQL_OPTIONS_REPL \

- -e "KILL ${thread}"

+ -e "${killthreads}"

+ rm -f $tmpfile

done

- rm -f $tmpfile

else

ocf_log info "Ignoring post-demote notification execpt for my own demotion."

db2.mysq.err

db1.mysql.err

Yves Trudeau

unread,

Jun 28, 2016, 2:58:14 PM6/28/16

to prm-d...@googlegroups.com

Hi Luis,

The most useful log here would be the agent trace file, to enable it:

https://github.com/percona/percona-pacemaker-agents/blob/master/doc/PRM-setup-guide.rst#enabling-trace-in-the-resource-agent

I have customers with the trace enabled all the time with logrotate handling the cleanup/rotation.

Also, let's review the obvious things:

- Do you have any swapping activity at that time? You can use sar to figure out.

- Do you have long running write transactions? These would block the read_only setting.

- You have many errors like: [ERROR] Error reading packet from server: Lost connection to MySQL server during query ( server_errno=2013), is this environment virtualized? Is the server-id settings different on all nodes?

I'll make sure the latest rpm contains the the latest code, I thought it was the case but I could have forgotten to generate a new one.

Regards,

Yves

--
You received this message because you are subscribed to the Google Groups "PRM-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prm-discuss...@googlegroups.com.
To post to this group, send email to prm-d...@googlegroups.com.
Visit this group at https://groups.google.com/group/prm-discuss.
For more options, visit https://groups.google.com/d/optout.

Luiz Goncalves

unread,

Jun 29, 2016, 12:48:27 AM6/29/16

to PRM-discuss

Hi yves

Answer's below in red!

Here are the debug logs

db1.log.debug -> https://drive.google.com/file/d/0B0hN67y8o50xTjhqV0hzdkFycW8/view?usp=sharing

db2.log.debub -> https://drive.google.com/file/d/0B0hN67y8o50xcm1SOUpJMHktcXM/view?usp=sharing

Let me know if you need any extra data...

Thanks

-Luiz

Em terça-feira, 28 de junho de 2016 15:58:14 UTC-3, yves escreveu:

Hi Luis,
The most useful log here would be the agent trace file, to enable it:

https://github.com/percona/percona-pacemaker-agents/blob/master/doc/PRM-setup-guide.rst#enabling-trace-in-the-resource-agent

I have customers with the trace enabled all the time with logrotate handling the cleanup/rotation.

Also, let's review the obvious things:

- Do you have any swapping activity at that time? You can use sar to figure out.

- Do you have long running write transactions? These would block the read_only setting.

No. It seems we are good. Chart attached

- You have many errors like: [ERROR] Error reading packet from server: Lost connection to MySQL server during query ( server_errno=2013), is this environment virtualized? Is the server-id settings different on all nodes?

It's virtualized, but do you see any problem with it? yes, I have distinct server-ids (1 and 2) as you can see below?

my.cnf on DB1

[mysqld]

# Remove leading # and set to the amount of RAM for the most important data

# cache in MySQL. Start at 70% of total RAM for dedicated server, else 10%.

# innodb_buffer_pool_size = 128M

# Remove leading # to turn on a very important data integrity option: logging

# changes to the binary log between backups.

# log_bin

# Remove leading # to set options mainly useful for reporting servers.

# The server defaults are faster for transactions and fast SELECTs.

# Adjust sizes as needed, experiment to find the optimal values.

# join_buffer_size = 128M

# sort_buffer_size = 2M

# read_rnd_buffer_size = 2M

#datadir=/var/lib/mysql

datadir=/encryptedfs/mysql

socket=/var/lib/mysql/mysql.sock

# Disabling symbolic-links is recommended to prevent assorted security risks

symbolic-links=0

log-bin

server-id=1

log-slave-updates

binlog_format=ROW

[mysqld_safe]

#log-error=/var/log/mysqld.log

log-error=/encryptedfs/log/mysql

pid-file=/var/run/mysqld/mysqld.pid

my.cnf on DB2

[mysqld]

# Remove leading # and set to the amount of RAM for the most important data

# cache in MySQL. Start at 70% of total RAM for dedicated server, else 10%.

# innodb_buffer_pool_size = 128M

# Remove leading # to turn on a very important data integrity option: logging

# changes to the binary log between backups.

# log_bin

# Remove leading # to set options mainly useful for reporting servers.

# The server defaults are faster for transactions and fast SELECTs.

# Adjust sizes as needed, experiment to find the optimal values.

# join_buffer_size = 128M

# sort_buffer_size = 2M

# read_rnd_buffer_size = 2M

#datadir=/var/lib/mysql

datadir=/encryptedfs/mysql

socket=/var/lib/mysql/mysql.sock

# Disabling symbolic-links is recommended to prevent assorted security risks

symbolic-links=0

log-bin

server-id=2

log-slave-updates

binlog_format=ROW

[mysqld_safe]

#log-error=/var/log/mysqld.log

log-error=/encryptedfs/log/mysql

pid-file=/var/run/mysqld/mysqld.pid

mysql_transaction_charts.png

Yves Trudeau

unread,

Jun 29, 2016, 7:39:57 PM6/29/16

to prm-d...@googlegroups.com

Hi Luiz,

apparently what timeout is the extraction of the last transactions from the binlog.

++ mysqlbinlog -vvv --base64-output=DECODE-ROWS /encryptedfs/mysql/mysqld-bin.000074

++ grep 'Xid =' -A2

++ grep -v '\-\-'

++ tail -n 9006

++ egrep 'Xid|\# at [0-9]{2,10}'

++ tac

++ grep -A2 Xid

++ rev

++ grep '# at '

++ tail -n 1

++ cut '-d ' -f1

++ rev

Wed Jun 29 03:26:10 UTC 2016

post - promoted

PRM tries to get the last ~9000 trx to help reconnect slaves if they didn't replicate some of the last trx. You are using encryptedfs and if your VM is low power, that may cause the operation to take too much time if the binlog file is too large. Can you try using a smaller value for max_binlog_size, the default is 1GB, you can try 100M or even 50M (even smaller if you don't generate too much binlog per day). I hope this will fix the issue.

Regards,

Yves

Reply all

Reply to author

Forward

0 new messages