Entire cluster breaks down when 'Error

Galleria

unread,

Jun 26, 2014, 1:56:02 PM6/26/14

to codersh...@googlegroups.com

Hello all. We are running into a scenario where our entire 3 node cluster becomes unusable. We don't use multi-master and have host 01 used for writing but all three are intended for reading. The problem starts when host 02 and 03 log the following:

[ERROR] Slave SQL: Could not execute Write_rows event on table mydb.mytbl; Duplicate entry '0' for key 'PRIMARY', Error_code: 1062; handler error HA_ERR_FOUND_DUPP_KEY; the event's master log FIRST, end_log_pos 35362, Error_code: 1062

These two instances then proceed to terminate while the 'master' keeps running but with 'Received NON-PRIMARY' which results in any query against it returning 'unknown command', possibly due to loosing quorum.

We're then forced to restart-bootstrap 01 prior to starting the fallen comrades. We're running the following on CentOS:

Percona-XtraDB-Cluster-server-56-5.6.15-25.5.759.rhel6.x86_64
Percona-XtraDB-Cluster-galera-2-2.8-1.157.rhel6.x86_64

The pertinent config file and detailed logs are attached. Any suggestions to deal with this scenario will be greatly appreciated. Thanks!

my.cnf

01-error.log

02-error.log

03-error.log

Daniel Black

unread,

Jun 26, 2014, 5:51:19 PM6/26/14

to Galleria, codersh...@googlegroups.com

> [ ERROR ] Slave SQL : Could not execute Write_rows event on table mydb
> . mytbl ; Duplicate entry '0' for key 'PRIMARY' , Error_code : 1062 ;

> handler error HA_ERR_FOUND_DUPP_KEY ; the event 's master log FIRST,
> end_log_pos 35362, Error_code: 1062

It looks like 01 is inserting a row into mydb.mytbl and the 02/03 already has a 0 for the primary key (hence unique).

There wasn't a 0 primary key entry in the table on 01.

So you have two different database contents, or perhaps table definitions, between the 01 and 02/03 machines.

--
Daniel Black, Engineer @ Open Query (http://openquery.com.au)
Remote expertise & maintenance for MySQL/MariaDB server environments.

Galleria

unread,

Jun 26, 2014, 6:25:23 PM6/26/14

to codersh...@googlegroups.com, ba...@axelabs.com, daniel...@openquery.com

From what our development team says, it seems there is a transaction on the master which gets rolled back when that duplicate is encountered but somehow it is still being replicated to the other cluster members. We are looking at this bug possibly being related:

https://bugs.launchpad.net/codership-mysql/+bug/1299116

The shutdown of the write master was due to https://bugs.launchpad.net/galera/+bug/1217225 where we got around it by updating to

Percona-XtraDB-Cluster-galera-2-2.10-1.188.rhel6.x86_64.

Reply all

Reply to author

Forward

Entire cluster breaks down when 'Error_code: 1062' occurs.

Galleria

Daniel Black

Galleria