Error reading relay log event: slave SQL thread was killed

2,122 views
Skip to first unread message

Luiz Goncalves

unread,
Jun 28, 2016, 9:05:57 AM6/28/16
to PRM-discuss
Hi yves, 

I'm facing an weird issue, which seems to be around for while. 

I have a master/slave pacemaker cluster running version 5.5.49 ( the latest one available )

Percona-Server-client-55-5.5.49-rel37.9.el7.x86_64

Percona-Server-shared-55-5.5.49-rel37.9.el7.x86_64

Percona-Server-server-55-5.5.49-rel37.9.el7.x86_64


The cluster was working good with no issues. I was able to move the master from db1 to db2 and from db2 to db1 normally. 




Then I realised that the replication was not working good anymore. (Attached goes the error logs for db1 and db2) 


I've executed the cmd below on db2 (the current slave) in order to fix the replication issue. After executing this, the issue get's fixed according to SHOW SLAVE STATUS as you can see below on db2 as well.


STOP SLAVE;

SET GLOBAL SQL_SLAVE_SKIP_COUNTER=1;

START SLAVE;




mysql> SHOW SLAVE STATUS\G

*************************** 1. row ***************************

               Slave_IO_State: Waiting for master to send event

                  Master_Host: 10.1.51.14

                  Master_User: repl_user

                  Master_Port: 3306

                Connect_Retry: 60

              Master_Log_File: mysqld-bin.000081

          Read_Master_Log_Pos: 92197903

               Relay_Log_File: mysqld-relay-bin.000002

                Relay_Log_Pos: 5899715

        Relay_Master_Log_File: mysqld-bin.000081

             Slave_IO_Running: Yes

            Slave_SQL_Running: Yes

              Replicate_Do_DB: 

          Replicate_Ignore_DB: 

           Replicate_Do_Table: 

       Replicate_Ignore_Table: 

      Replicate_Wild_Do_Table: 

  Replicate_Wild_Ignore_Table: 

                   Last_Errno: 0

                   Last_Error: 

                 Skip_Counter: 0

          Exec_Master_Log_Pos: 92197903

              Relay_Log_Space: 5899872

              Until_Condition: None

               Until_Log_File: 

                Until_Log_Pos: 0

           Master_SSL_Allowed: No

           Master_SSL_CA_File: 

           Master_SSL_CA_Path: 

              Master_SSL_Cert: 

            Master_SSL_Cipher: 

               Master_SSL_Key: 

        Seconds_Behind_Master: 0

Master_SSL_Verify_Server_Cert: No

                Last_IO_Errno: 0

                Last_IO_Error: 

               Last_SQL_Errno: 0

               Last_SQL_Error: 

  Replicate_Ignore_Server_Ids: 

             Master_Server_Id: 1

1 row in set (0.00 sec)



#my config's 


#  crm configure show

node 1: mysql55db1.xyz.com \

attributes p_mysql_mysql_master_IP=10.1.51.14 \

attributes standby=off

node 2: mysql55db2.xyz.com \

attributes p_mysql_mysql_master_IP=10.1.51.15 \

attributes standby=off

primitive fencedb1 stonith:fence_rhevm \

params port=mysql_5.5_db1 ipaddr=10.1.51.1 login="admin@internal" inet4_only=1 passwd=“xxxx“ ssl=1 ssl_insecure=1 power_wait=5 power_timeout=25 delay=3 pcmk_host_list=mysql55db1.xyz.com pcmk_off_retries=1000 \

op monitor interval=60s \

meta

primitive fencedb2 stonith:fence_rhevm \

params port=mysql_5.5_db2 ipaddr=10.1.51.1 login="admin@internal" inet4_only=1 passwd=“xxxx“ ssl=1 ssl_insecure=1 power_wait=5 power_timeout=25 delay=3 pcmk_host_list=mysql55db2.xyz.com pcmk_off_retries=1000 \

op monitor interval=60s \

meta

primitive p_mysql ocf:percona:mysql_prm \

params config="/etc/my.cnf" pid="/var/run/mysqld/mysqld.pid" socket="/var/lib/mysql/mysql.sock" replication_user=repl_user replication_passwd=xxxx test_user=test_user test_passwd=xxxx max_slave_lag=15 evict_outdated_slaves=false binary="/usr/sbin/mysqld" async_stop=1 prm_binlog_parser_path="/usr/local/bin/prm_binlog_parser" \

op monitor interval=5s role=Master timeout=30s OCF_CHECK_LEVEL=1 \

op monitor interval=2s role=Slave timeout=30s OCF_CHECK_LEVEL=1 \

op start interval=0 timeout=900s \

op stop interval=0 timeout=900s

primitive reader_vip_1 IPaddr2 \

params ip=10.1.51.103 nic=eth0 \

op monitor interval=10s \

meta target-role=Started

primitive writer_vip IPaddr2 \

params ip=10.1.51.102 nic=eth0 \

op monitor interval=10s \

meta target-role=Started

ms ms_MySQL p_mysql \

meta master-max=1 master-node-max=1 clone-max=2 clone-node-max=1 notify=true globally-unique=false target-role=Started is-managed=true

location location-reader_vip_1-1 reader_vip_1 \

rule -inf: readable eq 0

order ms_MySQL_promote_before_vip inf: ms_MySQL:promote writer_vip:start

colocation reader_vip_1_dislike_writer_vip -200: reader_vip_1 writer_vip

colocation writer_vip_on_master inf: writer_vip ms_MySQL:Master

property cib-bootstrap-options: \

have-watchdog=false \

dc-version=1.1.13-10.el7_2.2-44eb2dd \

cluster-infrastructure=corosync \

cluster-name=mysql55cluster \

no-quorum-policy=ignore \

stonith-enabled=true \

last-lrm-refresh=1467003018

property mysql_replication: \

p_mysql_REPL_STATUS="mysqld-bin.000081|92702817|1073741824" \

p_mysql_REPL_INFO="10.1.51.14|mysqld-bin.000081|86298442" \

p_mysql_NEW_MASTER_LAST_TRX="mysqld-bin.000076@|559,72f7599f34bd0740405130bba456560f|1331,532ccaeecfda12b19576e9a41b72cb7d|2125,6df2287eb07b88b9c66a1fe100476077|3074,8aea312135a4cdbfdf0e98a53785ac08|4052,1e3e0c21cd630f496a50dcda694f658b|4851,c6c712c7c9bf4737014b578c5ebd87b8|5241,38a1b0ea945925ea1c26c04f1b1f57cf|5992,10acce6f21bef7253cce44ae2ecba6b9|6977,ead460cf118b4c85d646dd44f5790aab|7369,2e3cc1e8f903e67e0a356642f5b3cd9e|8120,1cae5f0f3f1956548f2e1145bbd6e66f|8871,a7e1865a216914bbf651f768745d4586|9791,22b71c7b08c73cfd8fe23e05f578ed6b|10166,37f9992b5a6eb2a2d1ce6f4b7e2b67d6|11861,1dea2952f7a33444ad8b2e7d160c360a|12244,7980985c20f9128fbb8507e2c93cf1f7|15258,aa4a6195b203869f09c0cf018420b6c1|18877,4378faf7775a5cb4cf066419b9328004|20907,0a0c4617574106d2d9dfd0ac5e8ccc5b|21133,d42e367df50dcb2a8d7e30e90c92682c|21359,a9a58670b402a67d13f6581b8a2aa66a|22132,57e7dcf94077f216e0e28c6c9ac92322|23501,619c2e79dc588b686cd7e4cf6c8f0386|24314,15cce5ea2ad3eea8cf7bfd7057fd0d2f|24551,1e40a499d253430e23d4523382e92a9e|24788,aba64594727e918ebf6772d33a889190|25025,1480dabf3955abb232dfab83c7df9545|25251,2770a4ff22d9dd4bebba0f6b8b1add3e|25760,29db405916edba7bc21f34572066cdd4|26522,efc3598a7d01e5f17a92292baaacef2b|27880,905b7da2279e0d5e085818ed5ebc577e|28682,8b5c0295bd6186199f051a1477ed5d9f|@|"



#crm status


Stack: corosync

Current DC: mysql55db1.xyz.com (version 1.1.13-10.el7_2.2-44eb2dd) - partition with quorum

2 nodes and 6 resources configured


Online: [ mysql55db1.xyz.com mysql55db2.xyz.com ]


Full list of resources:


 reader_vip_1 (ocf::heartbeat:IPaddr2): Started mysql55db2.xyz.com

 writer_vip (ocf::heartbeat:IPaddr2): Started mysql55db1.xyz.com

 Master/Slave Set: ms_MySQL [p_mysql]

     Masters: [ mysql55db1.xyz.com ]

     Slaves: [ mysql55db2.xyz.com ]

 fencedb1 (stonith:fence_rhevm): Started mysql55db2.xyz.com

 fencedb2 (stonith:fence_rhevm): Started mysql55db1.xyz.com


Failed Actions:

* p_mysql_promote_0 on mysql55db2.xyz.com 'unknown error' (1): call=276, status=Timed Out, exitreason='none',

    last-rc-change='Tue Jun 28 10:14:00 2016', queued=0ms, exec=20002ms

^ This is the error I'm facing now!


When trying to move the master to db2 i've executed on db1 "crm node standby" and the p_mysql_promote_0 to db2 has failed as you can see above . So I've enabled again the db1 with "crm node online" and db1 move back to master normally. The status above is the result after the db1 get's back online.


So, what do you think it's the issue?  Is it a timeout issue or what? 

Also, I've decided to ask you, because I've realised some facts as you can see below. So, I'm confused.

The issue seems to be related to the issues https://github.com/percona/percona-pacemaker-agents/issues/9 and https://github.com/percona/percona-pacemaker-agents/issues/12. The symptoms are the almost the same, but the "fix" seems to be in place, since the rpm build was (39ec4f3  on Apr 1, 2015, according to github).


I've also realised that the version on master is not in sync with the rpm under the packages/build, as you can see below. I have installed this rpm (39ec4f3  on Apr 1, 2015)

 So,

1 - Should I build another rpm from github source in order to apply the following changes on my production environment ?

2 - Is it related to the issue that I'm facing ?


--- /usr/lib/ocf/resource.d/percona/mysql_prm 2015-03-31 18:17:52.000000000 -0300 <--- version install by https://github.com/percona/percona-pacemaker-agents/tree/master/packages/build/percona-agents-1.0.0-1.x86_64.rpm  (39ec4f3  on Apr 1, 2015)
+++ mysql_prm 2016-06-28 08:39:04.024000000 -0300 <--- version on master!
@@ -578,7 +578,7 @@
    cp "$binlog_files" /tmp
 
    let nlines=maxEntries*3+6  # below return 3 lines per Xid + need to skip last 2
-   startAt=`mysqlbinlog -vvv --base64-output=DECODE-ROWS $1 | \
+   startAt=`$MYSQLBINLOG -vvv --base64-output=DECODE-ROWS $1 | \
          grep 'Xid =' -A2 | grep -v '\-\-' | tail -n $nlines | \
          egrep 'Xid|\# at [0-9]{2,10}' | tac | grep -A2 Xid | \
          grep '# at ' | tail -n 1 | rev | cut -d' ' -f1 | rev`
@@ -879,10 +879,26 @@
          # still use the seconds behind master value to set our
          # master preference.
          local master_pref
-         master_pref=$((${OCF_RESKEY_max_slave_lag}-${secs_behind}))
-         if [ $master_pref -lt 0 ]; then
-            # Sanitize a below-zero preference to just zero
+         test $secs_behind -eq 0 2>/dev/null
+         if [ $? -eq 2 ]; then
+            # SBM is undefined or not an integer            
             master_pref=0
+            set_reader_attr 0
+         else
+            master_pref=$((${OCF_RESKEY_max_slave_lag}-${secs_behind}))
+            if [ $master_pref -lt 0 ]; then
+               # Sanitize a below-zero preference to just zero
+               master_pref=0
+            fi
+            
+            if [ $secs_behind -gt $OCF_RESKEY_max_slave_lag ]; then
+               set_reader_attr 0
+            else
+               set_reader_attr 1
+            fi
+            
+            #Edge case verification, check if on the right master
+            set_master nologging
          fi
 
          # Is the datadir almost full
@@ -893,22 +909,6 @@
             set_master_score -2147483640
          fi
       fi
-      
-      # is the slave ok to have a VIP on it
-      test $secs_behind -eq 0 2>/dev/null
-      if [ $? -eq 2 ]; then
-         set_reader_attr 0
-      else
-         if [ $secs_behind -gt $OCF_RESKEY_max_slave_lag ]; then
-            set_reader_attr 0
-         else
-            set_reader_attr 1
-            
-            #Edge case verification, check if on the right master
-            set_master nologging
-         fi
-      fi
-      
       ocf_log debug "MySQL instance running as a replication slave"
    else
       # Instance produced an empty "SHOW SLAVE STATUS" output --
@@ -1356,7 +1356,8 @@
    
    override_async_stop=0
    
-   if [ -n $2 ]; then
+   test $2 -eq 0 2>/dev/null
+   if [ $? -ne 2 ]; then
       override_async_stop=$2
    fi
    
@@ -1424,6 +1425,7 @@
             if [ "$last_crash_ts" -ne "0" ]; then
                if [ `date +%s` -lt "$((${last_crash_ts}+3600))" ]; then
                   # too soon, multiple crash, let's error out
+                  set_master_score 0
                   return $OCF_NOT_RUNNING
                fi
             fi
@@ -1441,6 +1443,7 @@
          else
             $CRM_ATTR -l reboot --name ${INSTANCE_ATTR_NAME}_last_crash -v `date +%s`
             # OCF_ERR_ARGS is a hard error, won't wait for restart
+            set_master_score 0
             return $OCF_ERR_ARGS
          fi
       fi
@@ -2126,25 +2129,28 @@
          demote_host=`echo $OCF_RESKEY_CRM_meta_notify_demote_uname|tr -d " "`
          if [ $demote_host = ${HOSTNAME} ]; then
             ocf_log info "post-demote notification for $demote_host"
-            set_read_only on
-            if [ $? -ne 0 ]; then
-               ocf_log err "Failed to set read-only";
-               return $OCF_ERR_GENERIC;
-            fi
             
-            # Must kill all existing user threads because they are still Read/write
-            # in order for the slaves to complete the read of binlogs
-            local tmpfile
-            tmpfile=`mktemp ${HA_RSCTMP}/threads.${OCF_RESOURCE_INSTANCE}.XXXXXX`
-            mysql_run -Q -sw -O $MYSQL $MYSQL_OPTIONS_REPL \
-            -e "SHOW PROCESSLIST" > $tmpfile
-            
-            for thread in `awk '$0 !~ /Binlog Dump|system user|event_scheduler|SHOW PROCESSLIST/ && $0 ~ /^[0-9]/ {print $1}' $tmpfile`
+            while ! get_read_only
             do
+               # This may timeout (5s) if there are long running queries (issue #44)
+               set_read_only on
+               
+               # Must kill all existing user threads because they are still Read/write
+               # in order for the slaves to complete the read of binlogs
+               local tmpfile
+               tmpfile=`mktemp ${HA_RSCTMP}/threads.${OCF_RESOURCE_INSTANCE}.XXXXXX`
+               mysql_run -Q -sw -O $MYSQL $MYSQL_OPTIONS_REPL \
+               -e "SHOW PROCESSLIST" > $tmpfile
+               
+               local killthreads=""
+               for thread in `awk '$0 !~ /Binlog Dump|system user|event_scheduler|SHOW PROCESSLIST/ && $0 ~ /^[0-9]/ {print $1}' $tmpfile`
+               do
+                   killthreads="${killthreads} KILL ${thread};"
+               done
                mysql_run -Q -sw $MYSQL $MYSQL_OPTIONS_REPL \
-               -e "KILL ${thread}"
+                  -e "${killthreads}"
+               rm -f $tmpfile
             done
-            rm -f $tmpfile
          else
             ocf_log info "Ignoring post-demote notification execpt for my own demotion."
          fi
db2.mysq.err
db1.mysql.err

Yves Trudeau

unread,
Jun 28, 2016, 2:58:14 PM6/28/16
to prm-d...@googlegroups.com
Hi Luis,
   The most useful log here would be the agent trace file, to enable it:


I have customers with the trace enabled all the time with logrotate handling the cleanup/rotation. 

Also, let's review the obvious things:

- Do you have any swapping activity at that time?  You can use sar to figure out.
- Do you have long running write transactions?  These would block the read_only setting.
- You have many errors like: [ERROR] Error reading packet from server: Lost connection to MySQL server during query ( server_errno=2013),  is this environment virtualized?  Is the server-id settings different on all nodes?

I'll make sure the latest rpm contains the the latest code, I thought it was the case but I could have forgotten to generate a new one.

Regards,

Yves



--
You received this message because you are subscribed to the Google Groups "PRM-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prm-discuss...@googlegroups.com.
To post to this group, send email to prm-d...@googlegroups.com.
Visit this group at https://groups.google.com/group/prm-discuss.
For more options, visit https://groups.google.com/d/optout.

Luiz Goncalves

unread,
Jun 29, 2016, 12:48:27 AM6/29/16
to PRM-discuss
Hi yves

Answer's below in red! 

Here are the debug logs



Let me know if you need any extra data...

Thanks
-Luiz


Em terça-feira, 28 de junho de 2016 15:58:14 UTC-3, yves escreveu:
Hi Luis,
   The most useful log here would be the agent trace file, to enable it:


I have customers with the trace enabled all the time with logrotate handling the cleanup/rotation. 

Also, let's review the obvious things:

- Do you have any swapping activity at that time?  You can use sar to figure out.
No
 
- Do you have long running write transactions?  These would block the read_only setting.
No. It seems we are good.  Chart attached
- You have many errors like: [ERROR] Error reading packet from server: Lost connection to MySQL server during query ( server_errno=2013),  is this environment virtualized?  Is the server-id settings different on all nodes?

It's virtualized, but do you see any problem with it?  yes, I have distinct server-ids (1 and 2) as you can see below?

my.cnf on DB1

[mysqld]

#

# Remove leading # and set to the amount of RAM for the most important data

# cache in MySQL. Start at 70% of total RAM for dedicated server, else 10%.

# innodb_buffer_pool_size = 128M

#

# Remove leading # to turn on a very important data integrity option: logging

# changes to the binary log between backups.

# log_bin

#

# Remove leading # to set options mainly useful for reporting servers.

# The server defaults are faster for transactions and fast SELECTs.

# Adjust sizes as needed, experiment to find the optimal values.

# join_buffer_size = 128M

# sort_buffer_size = 2M

# read_rnd_buffer_size = 2M

#datadir=/var/lib/mysql

datadir=/encryptedfs/mysql

socket=/var/lib/mysql/mysql.sock


# Disabling symbolic-links is recommended to prevent assorted security risks

symbolic-links=0

log-bin

server-id=1

log-slave-updates

binlog_format=ROW


[mysqld_safe]

#log-error=/var/log/mysqld.log

log-error=/encryptedfs/log/mysql

pid-file=/var/run/mysqld/mysqld.pid


my.cnf on DB2


[mysqld]

#

# Remove leading # and set to the amount of RAM for the most important data

# cache in MySQL. Start at 70% of total RAM for dedicated server, else 10%.

# innodb_buffer_pool_size = 128M

#

# Remove leading # to turn on a very important data integrity option: logging

# changes to the binary log between backups.

# log_bin

#

# Remove leading # to set options mainly useful for reporting servers.

# The server defaults are faster for transactions and fast SELECTs.

# Adjust sizes as needed, experiment to find the optimal values.

# join_buffer_size = 128M

# sort_buffer_size = 2M

# read_rnd_buffer_size = 2M

#datadir=/var/lib/mysql

datadir=/encryptedfs/mysql

socket=/var/lib/mysql/mysql.sock


# Disabling symbolic-links is recommended to prevent assorted security risks

symbolic-links=0

log-bin

server-id=2

log-slave-updates

binlog_format=ROW


[mysqld_safe]

#log-error=/var/log/mysqld.log

log-error=/encryptedfs/log/mysql

pid-file=/var/run/mysqld/mysqld.pid








mysql_transaction_charts.png

Yves Trudeau

unread,
Jun 29, 2016, 7:39:57 PM6/29/16
to prm-d...@googlegroups.com
Hi Luiz,
   apparently what timeout is the extraction of the last transactions from the binlog. 

++ mysqlbinlog -vvv --base64-output=DECODE-ROWS /encryptedfs/mysql/mysqld-bin.000074
++ grep 'Xid =' -A2
++ grep -v '\-\-'
++ tail -n 9006
++ egrep 'Xid|\# at [0-9]{2,10}'
++ tac
++ grep -A2 Xid
++ rev
++ grep '# at '
++ tail -n 1
++ cut '-d ' -f1
++ rev
Wed Jun 29 03:26:10 UTC 2016
post - promoted

PRM tries to get the last ~9000 trx to help reconnect slaves if they didn't replicate some of the last trx. You are using encryptedfs and if your VM is low power, that may cause the operation to take too much time if the binlog file is too large.  Can you try using a smaller value for max_binlog_size, the default is 1GB, you can try 100M or even 50M (even smaller if you don't generate too much binlog per day).  I hope this will fix the issue.

Regards,

Yves
Reply all
Reply to author
Forward
0 new messages