WAN implementation

55 views
Skip to first unread message

Jorge Oliveira

unread,
Oct 16, 2017, 8:10:22 AM10/16/17
to codership
We have a 5 nodes cluster: 2 datacenters with 2 nodes each and 1 datacenter with only 1 node.

Sometimes an insert in a node would take down all nodes (except itself) with a "Node consistency compromized", bringing down the cluster.
Looking at the log we could see the error:

[ERROR] Slave SQL: Could not execute Write_rows event on table main.documents; Duplicate entry '3729-3882600-01P2017-17040' for key 'sequence', Error_code: 1062; handler error HA_ERR_FOUND_DUPP_KEY; the event's master log FIRST, end_log_pos 455, Error_code: 1062

It seemed like nodes were accepting the same statement more than once (from different nodes) in that ms between certification and commit, and when applying the transaction by the second, it would raise the error.

We've already changed server configuration as recommended, by defining segments, changing network configuration, using transactions instead of autocommited statements...

[server]

net.core.rmem_max = 16777216
net.core.wmem_max = 16777216
net.core.rmem_default = 16777216
net.core.wmem_default = 16777216
net.ipv4.tcp_rmem = 4096 87380 16777216
net.ipv4.tcp_wmem = 4096 65536 16777216
net.ipv4.tcp_slow_start_after_idle = 0

[my.cnf]

innodb_flush_log_at_trx_commit = 2
binlog_row_image = minimal
wsrep_sync_wait = 2

wsrep_provider_options = '
gmcast.segment=1|2|3;
gcache.size=1G;
evs.user_send_window=256;
evs.send_window=512;
evs.suspect_timeout=PT15S;
evs.inactive_timeout=PT45S;
gcs.max_packet_size=1048576'.

--

Despite that, we keep getting this kind of error...

Any thoughts?

Thank you.

alexey.y...@galeracluster.com

unread,
Oct 30, 2017, 7:45:28 AM10/30/17
to Jorge Oliveira, codership
Hi,

It is very unlikely that this is an issue with a replication layer (i.e.
WAN/segments). If it were the case it would have been widely unusable
since Galera does not know what it really replicates so you would have
such behaviour in every cluster out there.

Much more likely it is a genuine DB inconsistency. If there are any
foreign keys involved, you're advised to upgrade to the latest release.

Regards,
Alex

On 2017-10-16 12:27, Jorge Oliveira wrote:
> We have a 5 nodes cluster: 2 datacenters with 2 nodes each and 1
> datacenter
> with only 1 node.
>
> Sometimes an insert in a node would take down all nodes (except itself)
> with a "Node consistency compromized", bringing down the cluster.
> Looking at the log we could see the error:
>
> [ERROR] Slave SQL: Could not execute Write_rows event on table
> main.documents; Duplicate entry '3729-3882600-01P2017-17040' for key
> 'sequence', Error_code: 1062; handler error HA_ERR_FOUND_DUPP_KEY; the
> event's master log FIRST, end_log_pos 455, Error_code: 1062
>
> It seemed like nodes were accepting the same statement more than once
> (from
> different nodes) in that ms between certification and commit, and when
> applying the transaction by the second, it would raise the error.

No. It is just that for some reason the master node for that transaction
didn't have that row while the rest of the nodes did.
Reply all
Reply to author
Forward
0 new messages