Repeated Galera Node Crash - Error_code: 1610

239 views
Skip to first unread message

Rupert Perry

unread,
Aug 16, 2013, 7:20:24 AM8/16/13
to codersh...@googlegroups.com
Hi Group,

We're running a 3 node Galera cluster (Maria DB Galera Cluster on Debian), using Percona xtrabackup as our SST method and we are seeing repeated crashes on one node referencing error code 1610.  We've had 11 crashes on a single node in the last 20 days or so.  Each time, the mysql error log reports an error which always starts off with Error code 1610, which looks like this:

130815 14:57:14 [ERROR] Slave SQL: Could not read field 'updated' of table 'sbld_client.base_mark', Error_code: 1610
130815 14:57:14 [ERROR] Slave SQL: Could not read field 'updated' of table 'sbld_client.base_mark', Error_code: 1610
130815 14:57:14 [ERROR] Slave SQL: Could not execute Delete_rows event on table sbld_client.base_mark; Got error 1610 from storage engine, Error_code: 1030; handler error No Error!; the event's master log FIRST, end_log_pos 235051, Error_code: 1030
130815 14:57:14 [Warning] WSREP: RBR event 294 Delete_rows apply warning: 1610, 113702148
130815 14:57:14 [ERROR] WSREP: Failed to apply trx: source: c48d5c22-0328-11e3-0800-744acba70669 version: 2 local: 0 state: APPLYING flags: 1 conn_id: 446646 trx_id: 4827106792 seqnos (l: 321761, g: 113702148, s: 113702144, d: 113702129, ts: 1376575034617112759)
130815 14:57:14 [ERROR] WSREP: Failed to apply app buffer: seqno: 113702148, status: WSREP_FATAL
   at galera/src/replicator_smm.cpp:apply_wscoll():53
   at galera/src/replicator_smm.cpp:apply_trx_ws():120
130815 14:57:14 [ERROR] WSREP: Node consistency compromized, aborting...
130815 14:57:14 [Note] WSREP: Closing send monitor...
130815 14:57:14 [Note] WSREP: Closed send monitor.
130815 14:57:14 [Note] WSREP: gcomm: terminating thread
130815 14:57:14 [Note] WSREP: gcomm: joining thread
130815 14:57:14 [Note] WSREP: gcomm: closing backend

or this:

130812  9:57:06 [ERROR] Slave SQL: Could not read field 'updated' of table 'sbld_client.item_billing', Error_code: 1610
130812  9:57:06 [ERROR] Slave SQL: Could not read field 'updated' of table 'sbld_client.item_billing', Error_code: 1610
130812  9:57:06 [ERROR] mysqld: Can't find record in 'item_billing'
130812  9:57:06 [ERROR] Slave SQL: Could not execute Delete_rows event on table sbld_client.item_billing; Can't find record in 'item_billing', Error_code: 1032; handler error HA_ERR_KEY_NOT_FOUND; the event's master log FIRST, end_log_pos 1390408, Error_code: 1032
130812  9:57:06 [Warning] WSREP: RBR event 1720 Delete_rows apply warning: 120, 107723031
130812  9:57:06 [ERROR] WSREP: Failed to apply trx: source: c48d5c22-0328-11e3-0800-744acba70669 version: 2 local: 0 state: APPLYING flags: 1 conn_id: 2576 trx_id: 4731916633 seqnos (l: 3521110, g: 107723031, s: 107723027, d: 107722998, ts: 1376297823995289349)
130812  9:57:06 [ERROR] WSREP: Failed to apply app buffer: seqno: 107723031, status: WSREP_FATAL
   at galera/src/replicator_smm.cpp:apply_wscoll():53
   at galera/src/replicator_smm.cpp:apply_trx_ws():120
130812  9:57:06 [ERROR] WSREP: Node consistency compromized, aborting...
130812  9:57:06 [Note] WSREP: Closing send monitor...
130812  9:57:06 [Note] WSREP: Closed send monitor.
130812  9:57:06 [Warning] WSREP: TO isolation failed for: 3, sql: /* loadreference(  ) */ -- TABLE NAME:link_confirmed
-- generated by pushClientData.pl on 20130812085531
-- select link_confirmed.* from sbl_core.link_confirmed WHERE 1
TRUNCATE TABLE sbl_core.link_confirmed. Check wsrep connection state and retry the query.
130812  9:57:06 [Note] WSREP: gcomm: terminating thread
130812  9:57:06 [Note] WSREP: gcomm: joining thread
130812  9:57:06 [Note] WSREP: gcomm: closing backend


We are running Maria DB server version "5.5.29-MariaDB-mariadb1~wheezy-log" and wsrep provider version "23.2.4(r147)".

I think the issue we are seeing looks the same as reported to MariaDB - MDEV-4404, which is currently unresolved - See here:

  https://mariadb.atlassian.net/browse/MDEV-4404

Can anyone suggest what might be wrong or what other information I can supply to help fix the problem?

Thanks,

Rupert.

Alex Yurchenko

unread,
Aug 18, 2013, 9:58:56 AM8/18/13
to codersh...@googlegroups.com
Hi,

The symptoms - persistent failures on a particular node - pretty much
rule out Galera bug.

1) I'd first look into how that node is different form the others.

2) Specifically I'd look if another SST method makes a difference.

3) I'd also check the datadir filesystem for errors...

Regards,
Alex
--
Alexey Yurchenko,
Codership Oy, www.codership.com
Skype: alexey.yurchenko, Phone: +358-400-516-011
Reply all
Reply to author
Forward
0 new messages