We migrated a series of 100G Databases into different Galera Instances.Its running fine so far. I do have 5-6 years experience with Galera.
The cluster has 5 Nodes and the Loadbalancers always targets 1 Node for Read/Write and only takes the second(third...) when the primary node is down (or reports wsrep_ready off).
Version is the same on all servers:
Server version: 5.6.37-82.2-56-log Percona XtraDB Cluster (GPL), Release rel82.2, Revision 114f2f2, WSREP version 26.21, wsrep_26.21
In the night, always roughly at the same time, all 4 non-active nodes drop-out of the cluster at the same time with this error:
2018-01-29 01:26:56 33304 [ERROR] Slave SQL: Could not execute Delete_rows event on table storedb.object; Cannot delete or update a parent row: a foreign key constraint fails (`storedb`.`object`, CONSTRAINT `fk_object_object` FOREIGN KEY (`parentid`) REFERENCES `object` (`objectid`)), Error_code: 1451; handler error HA_ERR_ROW_IS_REFERENCED; the event's master log FIRST, end_log_pos 250, Error_code: 1451
2018-01-29 01:26:56 33304 [Warning] WSREP: RBR event 3 Delete_rows apply warning: 152, 3097067
2018-01-29 01:26:56 33304 [Warning] WSREP: Failed to apply app buffer: seqno: 3097067, status: 1
The table is looking like this:
| object | CREATE TABLE `object` (
`objectid` int(11) NOT NULL AUTO_INCREMENT,
`classid` int(11) NOT NULL,
`parentid` int(11) DEFAULT NULL,
`alias` varchar(255) COLLATE utf8_bin NOT NULL,
`siteid` int(11) NOT NULL,
PRIMARY KEY (`objectid`),
UNIQUE KEY `u_object_guid` (`guid`),
UNIQUE KEY `u_object_alias` (`parentid`,`alias`),
KEY `i_object_class` (`classid`),
KEY `i_object_siteid_classid_alias` (`siteid`,`classid`,`alias`),
CONSTRAINT `fk_object_object` FOREIGN KEY (`parentid`) REFERENCES `object` (`objectid`),
CONSTRAINT `fk_object_object_1` FOREIGN KEY (`siteid`) REFERENCES `object` (`objectid`)
) ENGINE=InnoDB AUTO_INCREMENT=<verylargenumber> DEFAULT CHARSET=utf8 COLLATE=utf8_bin |
The Query that is generating the error is *really* easy:
DELETE FROM `storedb`.`object` WHERE objectid='highnumber';
I saved the datadir from the moment before the cluster crashed (just kept one node out of the cluster after the crash and started it without wsrep provider to have a state to play with).
I can issue the query above with no problem. It deletes the row.
A corresponding GRA.log file is generated but mysqbinlog complains that it cant read the event:
ERROR: Error in Log_event::read_log_event(): 'Found invalid event in binary log', data_len: 105, event_type: 32
I am running weekly consistency checks (percona pt-table-checksum) and they report no inconsistency.
Any idea what is happening here?
Greetings
Aku