Skip failing statement in galera multi-master: Duplicate Column, wsrep_local_state: 2

140 views
Skip to first unread message

Chris

unread,
Aug 14, 2015, 8:41:28 AM8/14/15
to codership

After a Schema-Change on which I have missed out to take care about replication, I now have one Server not synching anymore, as it tries to alter the table which already has been taken care of (Server 1&2 have the additional column and are fine, Server 3 tries to insert it again - errors and therewith is in wsrep_local_state: 2


The error log is quite clear with that:

150728 15:19:29 [ERROR] Slave SQL: Error 'Duplicate column name 'somecolumn'' on query. Default database: 'my_db'. Query: 'ALTER TABLE tablename ADD somecolumn DATETIME NULL', Internal MariaDB error code: 1060
150728 15:19:29 [Warning] WSREP: RBR event 1 Query apply warning: 1, 195651391
150728 15:19:29 [Warning] WSREP: Ignoring error for TO isolated action: source: f7715036-20ca-11e5-a90f-f23b6463dd16 version: 3 local: 0 state: APPLYING flags: 65 conn_id: 2348812 trx_id: -1 seqnos (l: 14350890, g: 195651391, s: 195651390, d: 195651390, ts: 20916266579392088)


As it is quite obvious this transaction can be skipped, I would like to do so - but can't really find anywhere how to (only for a master-slave setup with a "SQL_SLAVE_SKIP_COUNTER=1" - which won't work on galera multi-master).


Thanks in advance,

Regards Chris

Philip Stoev

unread,
Aug 14, 2015, 8:53:58 AM8/14/15
to Chris, codersh...@googlegroups.com
Hello,

DDL errors on replication are ignored by Galera, so that replication should
be able to continue to run even after such errors. Hence the log entry that
says " Ignoring error".

Why do you think the node is no longer synching with the cluster? Can you
provide SHOW STATUS 'wsrep%' and all entries from the error log?

Philip Stoev
--
You received this message because you are subscribed to the Google Groups
"codership" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to codership-tea...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Chris

unread,
Aug 14, 2015, 9:17:32 AM8/14/15
to codership, ch...@feel-free.co.nz
Hi Philip,
the Node stays in Donor mode and the Donor mode bounces to the next mode whenever I restart the Service:
wsrep% Variables of the Node:
+------------------------------+--------------------------------------+
| Variable_name                | Value                                |
+------------------------------+--------------------------------------+
| wsrep_local_state_uuid       | a7033a86-5474-11e4-a149-e72b659df522 |
| wsrep_protocol_version       | 5                                    |
| wsrep_last_committed         | 210441502                            |
| wsrep_replicated             | 4834425                              |
| wsrep_replicated_bytes       | 3471989787                           |
| wsrep_repl_keys              | 14506291                             |
| wsrep_repl_keys_bytes        | 227242154                            |
| wsrep_repl_data_bytes        | 2935344433                           |
| wsrep_repl_other_bytes       | 0                                    |
| wsrep_received               | 24424048                             |
| wsrep_received_bytes         | 17397462999                          |
| wsrep_local_commits          | 4834425                              |
| wsrep_local_cert_failures    | 0                                    |
| wsrep_local_replays          | 0                                    |
| wsrep_local_send_queue       | 0                                    |
| wsrep_local_send_queue_avg   | 0.000002                             |
| wsrep_local_recv_queue       | 0                                    |
| wsrep_local_recv_queue_avg   | 0.026222                             |
| wsrep_local_cached_downto    | 210262815                            |
| wsrep_flow_control_paused_ns | 60398791947                          |
| wsrep_flow_control_paused    | 0.000016                             |
| wsrep_flow_control_sent      | 8                                    |
| wsrep_flow_control_recv      | 1570                                 |
| wsrep_cert_deps_distance     | 64.825815                            |
| wsrep_apply_oooe             | 0.000265                             |
| wsrep_apply_oool             | 0.000089                             |
| wsrep_apply_window           | 1.000357                             |
| wsrep_commit_oooe            | 0.000000                             |
| wsrep_commit_oool            | 0.000000                             |
| wsrep_commit_window          | 1.000087                             |
| wsrep_local_state            | 2                                    |
| wsrep_local_state_comment    | Donor/Desynced                       |
| wsrep_cert_index_size        | 66                                   |
| wsrep_causal_reads           | 0                                    |
| wsrep_cert_interval          | 0.002410                             |
| wsrep_incoming_addresses     | p-db03:3306,p-db02:3306,p-db01:3306  |
| wsrep_cluster_conf_id        | 402                                  |
| wsrep_cluster_size           | 3                                    |
| wsrep_cluster_state_uuid     | a7033a86-5474-11e4-a149-e72b659df522 |
| wsrep_cluster_status         | Primary                              |
| wsrep_connected              | ON                                   |
| wsrep_local_bf_aborts        | 0                                    |
| wsrep_local_index            | 0                                    |
| wsrep_provider_name          | Galera                               |
| wsrep_provider_vendor        | Codership Oy <info_at_codership.com>    |
| wsrep_provider_version       | 25.3.5(rXXXX)                        |
| wsrep_ready                  | ON                                   |
| wsrep_thread_count           | 2                                    |
+------------------------------+--------------------------------------+

Logs are as described:
150702 17:00:52 [Note] WSREP: Tables flushed.
150702 17:01:01 [Note] WSREP: resuming provider at 26716
150702 17:01:01 [Note] WSREP: Provider resumed.
150702 17:01:18 [Note] WSREP: 2.0 (p-db01): State transfer from 0.0 (p-db03) complete.
150702 17:01:18 [Note] WSREP: Member 2.0 (p-db01) synced with group.
150728 15:19:29 [ERROR] Slave SQL: Error 'Duplicate column name 'somecolumn'' on query. Default database: 'my_db'. Query: 'ALTER TABLE sometable ADD somecolumn DATETIME NULL', Internal MariaDB error code: 1060

150728 15:19:29 [Warning] WSREP: RBR event 1 Query apply warning: 1, 195651391
150728 15:19:29 [Warning] WSREP: Ignoring error for TO isolated action: source: f7715036-20ca-11e5-a90f-f23b6463dd16 version: 3 local: 0 state: APPLYING flags: 65 conn_id: 2348812 trx_id: -1 seqnos (l: 14350890, g: 195651391, s: 195651390, d: 195651390, ts: 20916266579392088)
150812 18:21:04 [Warning] Neither --relay-log nor --relay-log-index were used; so replication may break when this MySQL server acts as a slave and has his hostname changed!! Please use '--log-basename=#' or '--relay-log=p-db03-relay-bin' to avoid this problem.


Thanks in advance,
Regards Chris

Philip Stoev

unread,
Aug 14, 2015, 9:40:32 AM8/14/15
to Chris, codersh...@googlegroups.com
The fact that this node is in a status of Donor means that some other node
is attempting to join or rejoin the cluster and its
wsrep_local_state_comment has a value of "Joining". If you can find that
node, its error log may tell you if there is a problem.

Chris

unread,
Aug 14, 2015, 9:58:58 AM8/14/15
to codership, ch...@feel-free.co.nz
Both the other nodes are in state 4 - Synched.

Here's the config of the node:
[mariadb-10.0]

binlog_format=ROW
default-storage-engine=innodb
query_cache_size=0
query_cache_type=0
bind-address=0.0.0.0
datadir=/var/lib/mysql
# INNODB settings
innodb_autoinc_lock_mode=2
innodb_locks_unsafe_for_binlog=1
innodb_buffer_pool_size=5591M
innodb_additional_mem_pool_size=20M
innodb_log_file_size=512M
innodb_log_buffer_size=64M
innodb_file_per_table
innodb_flush_log_at_trx_commit=2
innodb_buffer_pool_instances=2
innodb_log_files_in_group=2
innodb_thread_concurrency=0
innodb_flush_method = O_DIRECT
## avoid statistics update when doing e.g show tables
innodb_stats_on_metadata=0
# WSREP stuff
server_id=3
wsrep_provider=/usr/lib64/galera/libgalera_smm.so
wsrep_provider_options="gcache.size=128M"
wsrep_cluster_address="gcomm://p-db01,p-db02,p-db03"
wsrep_cluster_name='prod_cluster'
wsrep_node_address='p-db03'
wsrep_node_name='p-db03'
wsrep_sst_method=rsync
wsrep_sst_auth=sst_user:PASSWORD
# Generate fake primary keys for non-PK tables (required for multi-master
# and parallel applying operation)
wsrep_certify_nonPK=1
wsrep_max_ws_rows=131072
wsrep_max_ws_size=1073741824
# to enable debug level logging, set this to 1
wsrep_debug=0

I'm not certain why it wouldn't come back up to synched on its own - or would I have to "force it" into thinking it's OK to ignore this error?

Cheers,
Chris

alexey.y...@galeracluster.com

unread,
Aug 18, 2015, 5:01:11 AM8/18/15
to Chris, codership
On 2015-08-14 15:24, Chris wrote:
> After a Schema-Change on which I have missed out to take care about
> replication, I now have one Server not synching anymore, as it tries to
> alter the table which already has been taken care of (Server 1&2 have
> the
> additional column and are fine, Server 3 tries to insert it again -
> errors
> and therewith is in *wsrep_local_state: 2*

Server 3 does not try to do that - it has no brains for that. A cluster
client tries to do that. Or replication event from async master (which
can be skipped as usual). So look elsewhere.

Also, errors don't send the server to state 2. Either joiner or client
request does that.

> The error log is quite clear with that:
>
> 150728 15:19:29 [ERROR] Slave SQL: Error 'Duplicate column name
> 'somecolumn'' on query. Default database: 'my_db'. Query: 'ALTER TABLE
> tablename ADD somecolumn DATETIME NULL', Internal MariaDB error code:
> 1060

This error clearly states that this node already has this column. Is
this message from Server 3?

Chris

unread,
Aug 18, 2015, 10:42:02 AM8/18/15
to codership, ch...@feel-free.co.nz
The issue is - generally - resolved.
I restarted the other servers to make sure there are no error- messages, then restarted the "Server 3" (the one not synching)


> and therewith is in *wsrep_local_state: 2*

Server 3 does not try to do that - it has no brains for that. A cluster
client tries to do that. Or replication event from async master (which
can be skipped as usual). So look elsewhere.

Also, errors don't send the server to state 2. Either joiner or client
request does that.
...this must have happened on re- synchronisation of the servers.
 

> The error log is quite clear with that:
>
> 150728 15:19:29 [ERROR] Slave SQL: Error 'Duplicate column name
> 'somecolumn'' on query. Default database: 'my_db'. Query: 'ALTER TABLE
> tablename ADD somecolumn DATETIME NULL', Internal MariaDB error code:
> 1060

This error clearly states that this node already has this column. Is
this message from Server 3?

Yes, this is from Server 3. As stated, the Error doesn´t come up anymore after restarting all of the servers - one by one.
I hope this error won´t show up again - I can´t keep on restarting nodes :)

Is there any other documentation on what to do in similar cases (duplicate IDs etc)?

Thanks in advance for the comments,
Regards Chris

alexey.y...@galeracluster.com

unread,
Aug 18, 2015, 11:03:38 AM8/18/15
to Chris, codership
On 2015-08-18 17:42, Chris wrote:
> Is there any other documentation on what to do in similar cases
> (duplicate
> IDs etc)?

There is two types of such errors:
1) one that happens with DDLs - it may happen consistently on all nodes
and therefore is ignored, only logged.
2) one that happens with transactions - this indicates inconsistency
between the "master" node where transaction originates and the "slave"
node. In that case the affected node is aborted. It is up to the user to
resolve the inconsistency (choose the node that has "more correct"
data). This may be caused by both a bug or a human error.

There is no "skipping" of events and such.
Reply all
Reply to author
Forward
0 new messages