Issue 983 in tungsten-replicator: trepctl flush command can hang if heartbeat event is logged at end of MySQL binlog

20 views
Skip to first unread message

tungsten-...@googlecode.com

unread,
Aug 8, 2014, 6:07:33 PM8/8/14
to tungsten-repl...@googlegroups.com
Status: Accepted
Owner: robert.h...@continuent.com
Labels: Type-Defect Priority-High FoundIn-2.2.1

New issue 983 by robert.h...@continuent.com: trepctl flush command can hang
if heartbeat event is logged at end of MySQL binlog
http://code.google.com/p/tungsten-replicator/issues/detail?id=983

What steps will reproduce the problem?

1. Set up a MySQL server with a very small binlog size, e.g., 65K. Here's
the my.cnf setting:

max_binlog_size = 65K

2. Configure Tungsten master/slave replication with the aforesaid MySQL
server as the master.
3. Issue a series of flush commands using 'trepctl flush' or by calling the
flush() JMX API until the binlog turns over.

What is the expected output?

Flush commands should return the sequence number of the corresponding
heartbeat or a higher value.

What do you see instead?

Flush logged at the end of the binlog files results in the following error
when accessing through the JMX API:

[junit] junit.framework.AssertionFailedError: failed to exception:
java.lang.Exception: Flush operation failed: State transition failed
causing emergency recovery: state=ONLINE transition=FLUSH event=FlushEvent

What is the possible cause?

It appears that the logic to wait on log position in the replicator
pipeline is flawed.

What is the proposed solution?

Fix it!

Additional information

This error is fairly reproducible in system tests for the replicator, as
they use flush commands extensively. It could cause planned failover to
hang and/or crash, which means it has impact when Tungsten is used for
clustering.

Use labels and text to provide additional information.


--
You received this message because this project is configured to send all
issue notifications to this address.
You may adjust your notification preferences at:
https://code.google.com/hosting/settings

tungsten-...@googlecode.com

unread,
May 29, 2015, 7:59:02 PM5/29/15
to tungsten-repl...@googlegroups.com

Comment #1 on issue 983 by robert.h...@continuent.com: trepctl flush
command can hang if heartbeat event is logged at end of MySQL binlog
https://code.google.com/p/tungsten-replicator/issues/detail?id=983

The reason for this bug is as follows. When the log rotates it repositions
about 120 bytes in on MySQL 5.6, as it writes a header. The flush logic
proceeds as follows.

1.) Issue a heartbeat event, which updates the binlog position.
2.) Get the binlog position using SHOW MASTER STATUS.
3.) Wait for the replicator pipeline to see that binlog position.

The problem comes up if the log rotates after it writes the heartbeat
transaction in step #1. In that case the replicator will see the binlog
position from processing the heartbeat, which is in the old log file.
Meanwhile, when we do SHOW MASTER STATUS it will already have proceeded to
the new log file. Unless there is a new event after the heartbeat the
replicator will never see an event with a binlog position equal to or
greater than the SHOW MASTER STATUS value, hence will timeout on flush.

This bug can cause severe confusion in testing if you set a small
max_binlog_size value by accident, because tests rely on flush() operations
to decide when it is save to compare tables. (Ask me how I know.)
Reply all
Reply to author
Forward
0 new messages