Galera MariaDB HAProxy asynchronous slave

448 views
Skip to first unread message

Johnny Antonsen

unread,
Jun 25, 2014, 9:39:14 AM6/25/14
to codersh...@googlegroups.com
Hi!

I'm working on a setup like this:

The problem I'm having is that the MariaDB slave keep getting this message once the haproxy moves to a new host (haproxy is configured to be round robin):
Last_IO_Error: Got fatal error 1236 from master when reading data from binary log: 'Error: connecting slave requested to start from GTID 1-4-21377, which is not in the master's binlog'

Could someone give me some pointers on the reason for this failing? As far as I can see the cluster is in sync and configured correctly, and after googling furiously for 3 days I'm starting to run out of options. Maybe it's not even possible to do it like this?

Jay Janssen

unread,
Jun 25, 2014, 9:51:17 AM6/25/14
to Johnny Antonsen, codersh...@googlegroups.com
This is 5.5?   Your slave connects via the keepalived VIP?

There is nothing in Galera guaranteeing that all nodes have the same binlog names and positions, so this sort of failover wont’ work automatically.  You can manage a failover like this: http://www.mysqlperformanceblog.com/2013/06/21/changing-an-async-slave-of-a-pxc-cluster-to-a-new-master/ (this should work fine on MGC 5.5 as well).  


In 5.6/10, Galera should ensure the async GTIDs are the same on all the nodes, but you would still need to use CHANGE MASTER TO MASTER_AUTO_POSITION=1, …;  I wouldn’t expect a VIP to handle the slave’s master failover for you automatically, sorry.


Jay Janssen
Sign up now for Percona XtraDB Cluster/Galera training in San Jose, CA 

signature.asc

Johnny Antonsen

unread,
Jun 25, 2014, 10:00:47 AM6/25/14
to codersh...@googlegroups.com, 3qua...@gmail.com
Yes sorry, this is MariaDB 10.0.11 on ubuntu 14.04. The slave connects to the keepalived VIP and the haproxy handles which node the connection is directed to.

So running CHANGE MASTER TO MASTER_AUTO_POSITION=1 on the slave doesn't seem to work. Maybe due to it not being compatible on Mariadb 10.0.11? The slave reports the following on show slave status.

MariaDB [(none)]> show slave status\G;
*************************** 1. row ***************************
               Slave_IO_State: Waiting for master to send event
                  Master_Host: X.X.X.X
                  Master_User: replicator
                  Master_Port: 3306
                Connect_Retry: 60
              Master_Log_File: mariadb-bin.000016
          Read_Master_Log_Pos: 360
               Relay_Log_File: relay-bin.000002
                Relay_Log_Pos: 649
        Relay_Master_Log_File: mariadb-bin.000016
             Slave_IO_Running: Yes
            Slave_SQL_Running: Yes
              Replicate_Do_DB: 
          Replicate_Ignore_DB: mysql,replicatetest
           Replicate_Do_Table: 
       Replicate_Ignore_Table: MA4_Data.EOD_FileCache
      Replicate_Wild_Do_Table: 
  Replicate_Wild_Ignore_Table: 
                   Last_Errno: 0
                   Last_Error: 
                 Skip_Counter: 0
          Exec_Master_Log_Pos: 360
              Relay_Log_Space: 940
              Until_Condition: None
               Until_Log_File: 
                Until_Log_Pos: 0
           Master_SSL_Allowed: No
           Master_SSL_CA_File: 
           Master_SSL_CA_Path: 
              Master_SSL_Cert: 
            Master_SSL_Cipher: 
               Master_SSL_Key: 
        Seconds_Behind_Master: 0
Master_SSL_Verify_Server_Cert: No
                Last_IO_Errno: 0
                Last_IO_Error: 
               Last_SQL_Errno: 0
               Last_SQL_Error: 
  Replicate_Ignore_Server_Ids: 
             Master_Server_Id: 1
               Master_SSL_Crl: 
           Master_SSL_Crlpath: 
                   Using_Gtid: Current_Pos
                  Gtid_IO_Pos: 1-4-21377,4-4-6647,3-3-418,2-3-420


However, after running for a little while it does its next connection to the master (i noticed this by seeing a change in the Master_server_id value which was initally set to 1, 2 and 3 on each node, but was later change to 1 on all nodes per suggestions). When it connects again, haproxy redirects it to the next node. This is the output I get:

MariaDB [(none)]> show slave status\G;
*************************** 1. row ***************************
               Slave_IO_State: 
                  Master_Host: X.X.X.X
                  Master_User: replicator
                  Master_Port: 3306
                Connect_Retry: 60
              Master_Log_File: mariadb-bin.000014
          Read_Master_Log_Pos: 360
               Relay_Log_File: relay-bin.000003
                Relay_Log_Pos: 688
        Relay_Master_Log_File: mariadb-bin.000012
             Slave_IO_Running: No
            Slave_SQL_Running: No
              Replicate_Do_DB: 
          Replicate_Ignore_DB: mysql,replicatetest
           Replicate_Do_Table: 
       Replicate_Ignore_Table: MA4_Data.EOD_FileCache
      Replicate_Wild_Do_Table: 
  Replicate_Wild_Ignore_Table: 
                   Last_Errno: 1008
                   Last_Error: Error 'Can't drop database 'benchmark'; database doesn't exist' on query. Default database: 'benchmark'. Query: 'drop database benchmark'
                 Skip_Counter: 0
          Exec_Master_Log_Pos: 594
              Relay_Log_Space: 3099
              Until_Condition: None
               Until_Log_File: 
                Until_Log_Pos: 0
           Master_SSL_Allowed: No
           Master_SSL_CA_File: 
           Master_SSL_CA_Path: 
              Master_SSL_Cert: 
            Master_SSL_Cipher: 
               Master_SSL_Key: 
        Seconds_Behind_Master: NULL
Master_SSL_Verify_Server_Cert: No
                Last_IO_Errno: 1236
                Last_IO_Error: Got fatal error 1236 from master when reading data from binary log: 'Error: connecting slave requested to start from GTID 1-4-21377, which is not in the master's binlog'
               Last_SQL_Errno: 1008
               Last_SQL_Error: Error 'Can't drop database 'benchmark'; database doesn't exist' on query. Default database: 'benchmark'. Query: 'drop database benchmark'
  Replicate_Ignore_Server_Ids: 
             Master_Server_Id: 1
               Master_SSL_Crl: 
           Master_SSL_Crlpath: 
                   Using_Gtid: Current_Pos
                  Gtid_IO_Pos: 1-4-21377,4-4-6647,3-3-420,2-3-420

(You can ignore the drop database error, as that database does no longer exist).

Hope this gives some more information on the setup and maybe som ideas on how I could go about solving this. Any ideas are more than welcome.

- Johnny Antonsen

Jay Janssen

unread,
Jun 25, 2014, 10:52:47 AM6/25/14
to Johnny Antonsen, codersh...@googlegroups.com
On Jun 25, 2014, at 10:00 AM, Johnny Antonsen <3qua...@gmail.com> wrote:

Yes sorry, this is MariaDB 10.0.11 on ubuntu 14.04. The slave connects to the keepalived VIP and the haproxy handles which node the connection is directed to.

So running CHANGE MASTER TO MASTER_AUTO_POSITION=1 on the slave doesn't seem to work. Maybe due to it not being compatible on Mariadb 10.0.11? The slave reports the following on show slave status.

I can’t vouch for Maria 10’s GTIDs, which I recall now are different from Mysql 5.6’s.  I know this works in PXC 5.6. Some Maria guys will have to talk to this.  

However, after running for a little while it does its next connection to the master (i noticed this by seeing a change in the Master_server_id value which was initally set to 1, 2 and 3 on each node, but was later change to 1 on all nodes per suggestions). When it connects again, haproxy redirects it to the next node.

I don’t think HAproxy for a slave connection is a good idea, unless Maria’s GTID implementation has some vastly different ability than MySQL’s.  


Hope this gives some more information on the setup and maybe som ideas on how I could go about solving this. Any ideas are more than welcome.

Frankly, my recommended strategy would be to point your slave directly at a single node and to use the MASTER_AUTO_POSITION (or Maria equivalent) if that master fails manually.  

The only time I’ve really seen a slave auto failover with a VIP is in a DRBD setup where the binlog is replicated along with everything else on the primary, so the secondary just effectively becomes the primary in every way.   Every other slave failover system I’ve seen does a CHANGE MASTER.  

signature.asc

alexey.y...@galeracluster.com

unread,
Jun 25, 2014, 11:00:00 AM6/25/14
to codersh...@googlegroups.com
Just in case, do you have log_slave_updates=ON ?

On 2014-06-25 16:39, Johnny Antonsen wrote:
> Hi!
>
> I'm working on a setup like this:
>
> <https://lh4.googleusercontent.com/-pg0EvvlttzA/U6rOWvpncOI/AAAAAAAAdUc/7u1C-9beSXg/s1600/ClusterV2.jpg>

Johnny Antonsen

unread,
Jun 26, 2014, 8:44:54 AM6/26/14
to codersh...@googlegroups.com


On Wednesday, 25 June 2014 17:00:00 UTC+2, Alexey Yurchenko wrote:
Just in case, do you have log_slave_updates=ON ?

Yes, log_slave_updates is enabled.

Johnny Antonsen

unread,
Jul 4, 2014, 4:14:54 AM7/4/14
to codersh...@googlegroups.com
Hi!

Is it possible to get any new hints on this? I have tried contacting the mailing lists for mariadb, but so far no response, and the only response I have received here is that it's "probably" not possible. I would really appreciate a look into this from a galera/mariadb developer that can confirm my discoveries.

So far I've found the following:

The problem comes when we want to do a standard master/slave replication from the cluster to an external slave. The slave is set up to connect to the VIP (or I have also been testing with connecting directly to the haproxied ip), and Using_gtid is set to Slave_pos.


However, after some time, once the connection changes to a different node through haproxy, the following error occurs:

Got fatal error 1236 from master when reading data from binary log: 'Error: connecting slave requested to start from GTID 3-1-422, which is not in the master's binlog'


And the Slave_IO_State shows that it's no longer in sync.


I have run SELECT @@GLOBAL.gtid_slave_pos; to check what the current GTID for each node is, and they all return: 1-1-2145, however, sometimes if I add a lot of data, that value is different on some nodes, which is why I think the slave gets confused.


On the slave, when activating using_gtid=slave_pos, the following gtid_IO_pos appear: 1-1-2464,2-3-420,3-1-422


From what I have read, this should be somewhat correct, as the first value is the server id. However, in the config I have specified that node 1 has server id 1, node 2 has id 2 and so on, and that the same goes for gtid_domain_id. Is this the correct setup or do the nodes need to have the same server-id or gtid_domain_id?


Surely there must be a good way to solve this? Is the system not built to handle an asynchronous slave replicating from one random node?


Hope to hear from someone soon.

erkan yanar

unread,
Jul 4, 2014, 5:00:50 AM7/4/14
to Johnny Antonsen, codersh...@googlegroups.com
Hi Johnny,

I just answered in Maria.
Anyway as not everyone is reading the MariaDB mailing list:


On Fri, Jul 04, 2014 at 01:14:54AM -0700, Johnny Antonsen wrote:
>
> However, after some time, once the connection changes to a different node
> through haproxy, the following error occurs:
>
> Got fatal error 1236 from master when reading data from binary log: 'Error:
> connecting slave requested to start from GTID 3-1-422, which is not in the
> master's binlog'

So it is not there. As Alexey mentioned it you need log_bin and log_slave_updates entabled.
There is also no reason to use different domain_id's.

>
>
> And the Slave_IO_State shows that it's no longer in sync.
>
>
> I have run SELECT @@GLOBAL.gtid_slave_pos; to check what the current GTID
> for each node is, and they all return: 1-1-2145, however, sometimes if I
> add a lot of data, that value is different on some nodes, which is why I
> think the slave gets confused.

What do you mean by node? Galera or the slave?

>
>
> On the slave, when activating using_gtid=slave_pos, the following
> gtid_IO_pos appear: 1-1-2464,2-3-420,3-1-422
>
>
> From what I have read, this should be somewhat correct, as the first value
> is the server id. However, in the config I have specified that node 1 has
> server id 1, node 2 has id 2 and so on, and that the same goes for
> gtid_domain_id. Is this the correct setup or do the nodes need to have the
> same server-id or gtid_domain_id?

You don't need different domain_id's

>
>
> Surely there must be a good way to solve this? Is the system not built to
> handle an asynchronous slave replicating from one random node?
>

All I can only tell you, Im using that kind of setup and haven't encountered the problems you did.

Regards
Erkan

--
über den grenzen muß die freiheit wohl wolkenlos sein

erkan yanar

unread,
Jul 28, 2014, 6:43:26 PM7/28/14
to Johnny Antonsen, codersh...@googlegroups.com

I was wrong.
To be frankly having *one* GTID incremented by every node for every DML/DDL
is some kind of broken in MariaDB.
It is going to work in an ideal world.
But in an ideal world I most likely don't need galera *g*

http://linsenraum.de/erkules_int/2014/07/galera-cluster-using-gtid-mysql-vs-mariadb.html
Reply all
Reply to author
Forward
0 new messages