[MySQL Master Crash, Slave Replication

0 views

Skip to first unread message

Kody Coste

unread,

Jun 12, 2024, 7:19:48 AM6/12/24

to tiametiwor

I'm now trying to fix it, but I'm wondering - why can't just slave instance automatically pick up where it was left when the master server was stopped (to whatever reason)? Why is it implemented such way that slave instances totally lose track of what's happening when master server is restarted?

MySQL Master Crash, Slave Replication

DOWNLOAD • https://t.co/SISMMXYtMQ

The relay-log.info can become corrupted if the replica crashes. For this reason, in MySQL 5.6, they now offer the option to store the same information in a crash-safe InnoDB table instead of in a file.

The replica can be offline for a long time. If the master purges the binary log file that the replica was reading from when it stopped, the replica can't continue. The replica needs to read a contiguous series of events, or else it can't assure full data replication.

The master might corrupt its binary log files if the master crashes. You can reduce this risk by enabling the sync-binlog configuration variable on the master, with the understanding that it decreases performance on the master.

I had this issue on a Master-Master replication setup from the beginning on a clean system and clean VMs and it turned out that the permissions for the replication user account on each server allowing the other server to read the binlog index file on the master and vice versa were not allowing this so they got out of step and every time the binlog index number changed which on my system is every time the MySQL service restarts or the host server reboots, the replication broke in one way or the other depending on which server rebooted.

Check to see if the current active binlog matches what the other server is using to sync from the other server, and you may find that one server is looking for binlog.000006 and the other server is using a newer log like bin-log.000008 or something.

To che3ck the binlog on the slave or the master, type SHOW MASTER STATUS. The result will show you the name and position in the log that the partner server needs to be looking at. On the other server, run these commands...

Because replication is async, a transaction that has been committed to the master may not be able to leave the master before the crash happens. So there could be inconsistency between the original master and slaves, one of which will become the new master.

If it looks like this, everything is fine. If the Seconds_Behind_Master is > 0, the replica is catching up.If the Slave_IO_Running is not running, you have an uncommon problem, check the error logs.If Slave_SQL_Running is not running, try to start it with "START SLAVE;". If that fails, check if there is an error mentioned in the "Last_Error" line.

Transactions are a problem. Especially if you use a transaction agnostic backend like MyISAM. Using InnoDB should work. AFAIK only completed transaction are written to the binlog and thus to the replicas. This will only apply if the database is aware of your transactions.

With the commands given earlier you can check the state of all your slaves and promote the freshest (hint: Log_Pos) slave to master. maybe just temporarily, until all slaves are fresh again and then promote the designated server.

I would recommend avoiding MMM like the plague. It is very risky software and causes much more downtime than it prevents. I have extensive experience with it and my company tried to fix its problems, but it's unfixable. I'm not sure if it's appropriate to post a link to my blog post where I explain why this is true. The original author of MMM agrees, by the way, that it's a disaster.

The main tool I'd recommend is mmm. It handles circular replication, multiple slaves, failover and automatic promotion to master (and associated repointing of slaves), all transparent to clients via managed floating IPs, and it works beautifully (I had a primary DB server disappear just last night due to a dead switch, and my clients didn't even notice).

In conjunction with mmm I'd recommend xtrabackup as it can be used as a really fast and elegant way of setting up new slaves (perhaps to replace a machine that died), much faster than loading from an sql dump.

we are building a simple master/slave MySQL configuration using asynchronous replication, with MySQL enterprise 5.5.17 on both servers and innoDB based tables. In case of a crash of the master server we would like to offer to our users the possibility of recovering the master using the most-up-to date database contents of the slave. On the master server, the database and the binary logs are stored in different disk devices, for better reliability.

Here is the beauty of this tool: On a slave, it will read the relay logs, look for all queries that have a WHERE clause, convert it to a SELECT and execute it. That way, the caches for InnoDB and MyISAM are essentially the same on the Slave as it is on the Master. The differences should be minor.

Suppose M1 crashed, you failover to M2, and you bring up M1. Your goal is to reestablish Circular Replication. With a crash, there is the possibility of replication losing its place. Here is what to do:

Coming MySQL 5.6 on Debian Wheezy, we began to experience mysql replication breakages after abrupt shutdowns or sudden machine crashes. When systems came back up, more frequently then not, mysql replication would stop due to corrupted slave relay logs.

master-info-repository=TABLE and relay-log-info-repository=TABLE instruct mysql to store master and relay log information into the mysql database rather than in separated *.info files in the /var/lib/mysql folder.
This is important because in case of crashes, we would like to ensure that master/relay log information is subject to the same ACID properties that the database itself provides. Corollary: make sure the relevant meta tables have InnoDB as storage engine.
For example, a SHOW CREATE TABLE slave_master_info should say Engine=InnoDB.

When: MySQL 5.6
Why:
* This means that transactions being committed near the same time in InnoDB will merge together and write to transaction logs as one operation.
* This is important because it makes making MySQL run in durable mode not result in as big of a performance drop.
* With the addition of changes like this (and fast SSDs) I recommend durability to a lot of users.

When: MySQL 5.6
Why:
* Previously it was very difficult when you had a master fail with many slaves. You could not simply pick a new master from one of the slaves and reposition the other slaves under it.
* With GITDs failover and maintenance operations that change the topology are much easier.

When: Independent Release
Why:
* MySQL now releases an official set of MySQL Utilities that provide command-line utilities for maintaining and administering MySQL servers.
* Notably mysqlfailover provides a scripted way to failover when using Replication with GTIDs. MySQL Fabric (Not yet GA; Labs Release Only) provides sharding with High-Availability groups.

LinkedIn and 3rd parties use essential and non-essential cookies to provide, secure, analyze and improve our Services, and to show you relevant ads (including professional and job ads) on and off LinkedIn. Learn more in our Cookie Policy.

In a MySQL 5.7 master-slave setup that uses the default semisynchronous replication setting for rpl_semi_sync_master_wait_point, a crash of the master and failover to the slave is considered to be lossless. However, when the crashed master comes back, you may find that it has transactions that are not present in the current master (which was previously a slave). This behavior may be puzzling, given that semisynchronous replication is supposed to be lossless, but this is actually an expected behavior in MySQL. Why exactly this happens is explained in full detail in the blog post by Jean-Franois Gagn (JF).

Given such a scenario, MySQL documentation recommends that the crashed master must be discarded and should not be restarted. However, discarding a server like this is expensive and inefficient. In this blog post, we will explain an approach to detect and fix transactions on the crashed MySQL master server in a semisynchronous replication setup, and how to re-slave it back into your master-slave setup.

Another scenario is when your app retries the transaction that had failed during master crash. On the recovered MySQL master (which is now a slave), this transaction would actually exist, and again, results in a replication error.

In cases where the application does not retry the failed transaction and there are no primary key collisions in future, a replication error may not occur. As a result, the data inconsistency may go undetected.

Now, if we call the GTID_SUBSET function as GTID_SUBSET(GTID set of recovered master, GTID set of new master), the return value will be true, only if the recovered master does not have any extra transactions. In our example above, since the recovered master has extra transactions 9691 to 9700, the result of the above query is false.

Based on the above step, it is possible to know if the recovered master has extra transactions, and what these transactions are using the GTID function: GTID_SUBTRACT(GTID set of recovered master, GTID set of new master).

It is also possible to extract these extra transactions from the binary logs and save them. It may be useful for your business team to later review these transactions to make sure we are not inadvertently losing any important business information, even though it was uncommitted. Once this is done, we need a way to get rid of these extra transactions so that the recovered master can be re-slaved without issues.

If your current master crashes, ScaleGrid automates the failover process and promotes a suitable slave as the new master. The old master is then recovered, and we automatically detect if there are extra transactions on it. If any are found, the MySQL deployment is put in a degraded state we use automated tools to pull out the extra transactions and save them for your review. Our support team can then restore the old master to a good state, and re-slave it back into your master-slave setup so that you will have a healthy deployment!