Cluster Crash: HA_ERR_ROW_IS_REFERENCED

49 views
Skip to first unread message

Akulatraxas Drak

unread,
Jan 30, 2018, 9:22:47 AM1/30/18
to codership
Hi,

We migrated a series of 100G Databases into different Galera Instances.Its running fine so far. I do have 5-6 years experience with Galera.
The cluster has 5 Nodes and the Loadbalancers always targets 1 Node for Read/Write and only takes the second(third...) when the primary node is down (or reports wsrep_ready off).

Version is the same on all servers:
Server version: 5.6.37-82.2-56-log Percona XtraDB Cluster (GPL), Release rel82.2, Revision 114f2f2, WSREP version 26.21, wsrep_26.21

In the night, always roughly at the same time, all 4 non-active nodes drop-out of the cluster at the same time with this error:
2018-01-29 01:26:56 33304 [ERROR] Slave SQL: Could not execute Delete_rows event on table storedb.object; Cannot delete or update a parent row: a foreign key constraint fails (`storedb`.`object`, CONSTRAINT `fk_object_object` FOREIGN KEY (`parentid`) REFERENCES `object` (`objectid`)), Error_code: 1451; handler error HA_ERR_ROW_IS_REFERENCED; the event's master log FIRST, end_log_pos 250, Error_code: 1451
2018-01-29 01:26:56 33304 [Warning] WSREP: RBR event 3 Delete_rows apply warning: 152, 3097067
2018-01-29 01:26:56 33304 [Warning] WSREP: Failed to apply app buffer: seqno: 3097067, status: 1


The table is looking like this:
| object | CREATE TABLE `object` (
  `objectid` int(11) NOT NULL AUTO_INCREMENT,
  `classid` int(11) NOT NULL,
  `parentid` int(11) DEFAULT NULL,
  `alias` varchar(255) COLLATE utf8_bin NOT NULL,
  `siteid` int(11) NOT NULL,
  PRIMARY KEY (`objectid`),
  UNIQUE KEY `u_object_guid` (`guid`),
  UNIQUE KEY `u_object_alias` (`parentid`,`alias`),
  KEY `i_object_class` (`classid`),
  KEY `i_object_siteid_classid_alias` (`siteid`,`classid`,`alias`),
  CONSTRAINT `fk_object_object` FOREIGN KEY (`parentid`) REFERENCES `object` (`objectid`),
  CONSTRAINT `fk_object_object_1` FOREIGN KEY (`siteid`) REFERENCES `object` (`objectid`)
) ENGINE=InnoDB AUTO_INCREMENT=<verylargenumber> DEFAULT CHARSET=utf8 COLLATE=utf8_bin |


The Query that is generating the error is *really* easy:
DELETE FROM `storedb`.`object` WHERE objectid='highnumber';

I saved the datadir from the moment before the cluster crashed (just kept one node out of the cluster after the crash and started it without wsrep provider to have a state to play with). 
I can issue the query above with no problem. It deletes the row. 

A corresponding GRA.log file is generated but mysqbinlog complains that it cant read the event:
ERROR: Error in Log_event::read_log_event(): 'Found invalid event in binary log', data_len: 105, event_type: 32


I am running weekly consistency checks (percona pt-table-checksum) and they report no inconsistency.

Any idea what is happening here?

Greetings
Aku

john danilson

unread,
Jan 30, 2018, 1:47:38 PM1/30/18
to codership
the timing is interesting, but otherwise identical to what we were seeing in pxc 5.7.17.  It appears to be a bug as described here:
https://bugs.launchpad.net/percona-xtradb-cluster/+bug/1692745
and supposedly fixed in 5.7.19.  I say supposedly because we think we hit it again after the upgrade to 5.7.19; it only happened once, and i'm trying to trap debug information if it happens again.  

Hope it's helpful

Akulatraxas Drak

unread,
Jan 31, 2018, 6:08:03 AM1/31/18
to codership
Do you have an idea how to catch more debug information?
At the moment i am running into the problem once per day and the bug says i am running an already fixed version.

I can't believe its not a common thing. To me it looks like something you'd encounter really often.

Akulatraxas Drak

unread,
Feb 1, 2018, 6:09:40 AM2/1/18
to codership

Setting wsrep_slave_threads =1; "fixes" the problem.
No, its not a fix as its slowing down the cluster, it just a good indicator that this is a Bug. Should i report it somewhere ?

john danilson

unread,
Feb 1, 2018, 7:44:45 AM2/1/18
to codership
Yes, we also received that recommendation from Percona support when we had the initial problem under 5.7.17.  That's not acceptable to us as it slows us down too much.  Meanwhile the solution provided was to upgrade to 5.7.19 which we did.    But a short time later the same error occurred and  since then I have wsrep_debug=1 set but the problem has not occurred again (mainly due to a change in how the application handles deletes).    I suspect the bug is still there with some different edge case.  We also tried unsuccessfully to change the conditions such that the child rows were deleted in one transaction and the parent row in another; at least under 5.7.17 this made no difference and the bug remained. 
Reply all
Reply to author
Forward
0 new messages