There is no code difference from mk-table-sync to pt-table-sync in this
initial PT release, so I suspect something like a race condition is
happening. However, it is also probably a bug in the tool. To
diagnose the problem we would probably need to study the output of
MKDEBUG=1 pt-table-sync <options> very carefully. Can you look at that
and see if you can gain any insight?
--
Chief Performance Architect at Percona <http://www.percona.com/>
+1 (888) 401-3401 x507
Calendar: <http://bit.ly/baron-percona-cal> (Eastern Time)
Percona Live Conference comes to London! <http://www.percona.com/live>
> There is no code difference from mk-table-sync to pt-table-sync in this
> initial PT release, so I suspect something like a race condition is
For clarification, this was a long time ago on an old release and traditional master-slave setup without ssh tunnels.
>
> happening. However, it is also probably a bug in the tool. To
> diagnose the problem we would probably need to study the output of
> MKDEBUG=1 pt-table-sync <options> very carefully. Can you look at that
> and see if you can gain any insight?
>
Thanks for the tip about MKDEBUG=1. Now I can see what's going on!
The problem stems from the fact we'd set MASTER_HOST="localhost" instead of MASTER_HOST="127.0.0.1".
The mysql driver (libmysqlclient) treats localhost implicitly as a socket (e.g. /var/lib/mysql/mysql.sock), ignoring the port declarations. So, when the pt-table-sync establishes a connection to "localhost:3307" based on Master_Host and Master_Port, it's really using the local socket and never connecting to the remote host.
So, for example, MASTER_POS_WAIT was caught waiting to catch up to itself (the slave), but since the slave was obviously not replicating with itself, it would never catch up.
The error "Called not_in_right in state 0" was due to a dead lock trying to lock a chunk of rows on itself, instead of on the true master.
This also explains some other oddities we were seeing with the replication.
Thanks,
Erik Osterman