Delete_rows_log_event::find_row() on node2 preventing writes on node1

Freejack

unread,

Oct 20, 2012, 12:04:50 PM10/20/12

to codersh...@googlegroups.com

Hi all. Just deployed a second galera cluster to our production environment and we're seeing an issue where replicating a large delete query to the secondary node will prevent writes on the first node. node2 seems to be stuck running Delete_rows_log_event::find_row(137577) while the write queries are piling up on node1 to the point where all connections are used up and the db is no longer accessible.

Based on another similar thread, this may possibly be due to my innodb settings although I can't rule out galera at this point. Please see below. Any suggestions/tips would be hugely appreciated! Thanks!

+-------------------------------------------+------------------------+

| Variable_name | Value |

+-------------------------------------------+------------------------+

| have_innodb | YES |

| ignore_builtin_innodb | OFF |

| innodb_adaptive_flushing | ON |

| innodb_adaptive_flushing_method | estimate |

| innodb_adaptive_hash_index | ON |

| innodb_adaptive_hash_index_partitions | 1 |

| innodb_additional_mem_pool_size | 8388608 |

| innodb_autoextend_increment | 8 |

| innodb_autoinc_lock_mode | 2 |

| innodb_blocking_buffer_pool_restore | OFF |

| innodb_buffer_pool_instances | 1 |

| innodb_buffer_pool_restore_at_startup | 0 |

| innodb_buffer_pool_shm_checksum | ON |

| innodb_buffer_pool_shm_key | 0 |

| innodb_buffer_pool_size | 134217728 |

| innodb_change_buffering | all |

| innodb_checkpoint_age_target | 0 |

| innodb_checksums | ON |

| innodb_commit_concurrency | 0 |

| innodb_concurrency_tickets | 500 |

| innodb_corrupt_table_action | assert |

| innodb_data_file_path | ibdata1:10M:autoextend |

| innodb_data_home_dir | |

| innodb_dict_size_limit | 0 |

| innodb_disallow_writes | OFF |

| innodb_doublewrite | ON |

| innodb_doublewrite_file | |

| innodb_fake_changes | OFF |

| innodb_fast_checksum | OFF |

| innodb_fast_shutdown | 1 |

| innodb_file_format | Barracuda |

| innodb_file_format_check | ON |

| innodb_file_format_max | Barracuda |

| innodb_file_per_table | ON |

| innodb_flush_log_at_trx_commit | 1 |

| innodb_flush_method | |

| innodb_flush_neighbor_pages | area |

| innodb_force_load_corrupted | OFF |

| innodb_force_recovery | 0 |

| innodb_ibuf_accel_rate | 100 |

| innodb_ibuf_active_contract | 1 |

| innodb_ibuf_max_size | 67092480 |

| innodb_import_table_from_xtrabackup | 0 |

| innodb_io_capacity | 5000 |

| innodb_kill_idle_transaction | 0 |

| innodb_large_prefix | OFF |

| innodb_lazy_drop_table | 0 |

| innodb_lock_wait_timeout | 50 |

| innodb_locks_unsafe_for_binlog | ON |

| innodb_log_block_size | 512 |

| innodb_log_buffer_size | 8388608 |

| innodb_log_file_size | 5242880 |

| innodb_log_files_in_group | 2 |

| innodb_log_group_home_dir | ./ |

| innodb_max_dirty_pages_pct | 75 |

| innodb_max_purge_lag | 0 |

| innodb_merge_sort_block_size | 1048576 |

| innodb_mirrored_log_groups | 1 |

| innodb_old_blocks_pct | 37 |

| innodb_old_blocks_time | 0 |

| innodb_open_files | 300 |

| innodb_page_size | 16384 |

| innodb_purge_batch_size | 20 |

| innodb_purge_threads | 1 |

| innodb_random_read_ahead | OFF |

| innodb_read_ahead | linear |

| innodb_read_ahead_threshold | 56 |

| innodb_read_io_threads | 64 |

| innodb_recovery_stats | OFF |

| innodb_recovery_update_relay_log | OFF |

| innodb_replication_delay | 0 |

| innodb_rollback_on_timeout | OFF |

| innodb_rollback_segments | 128 |

| innodb_show_locks_held | 10 |

| innodb_show_verbose_locks | 0 |

| innodb_spin_wait_delay | 6 |

| innodb_stats_auto_update | 1 |

| innodb_stats_method | nulls_equal |

| innodb_stats_on_metadata | ON |

| innodb_stats_sample_pages | 8 |

| innodb_stats_update_need_lock | 1 |

| innodb_strict_mode | OFF |

| innodb_support_xa | ON |

| innodb_sync_spin_loops | 30 |

| innodb_table_locks | ON |

| innodb_thread_concurrency | 0 |

| innodb_thread_concurrency_timer_based | OFF |

| innodb_thread_sleep_delay | 10000 |

| innodb_use_global_flush_log_at_trx_commit | ON |

| innodb_use_native_aio | ON |

| innodb_use_sys_malloc | ON |

| innodb_use_sys_stats_table | OFF |

| innodb_version | 1.1.8-rel28.1 |

| innodb_write_io_threads | 64 |

+-------------------------------------------+------------------------+

Freejack

unread,

Oct 20, 2012, 12:07:05 PM10/20/12

to codersh...@googlegroups.com

Forgot to mention that only node1 is accepting write queries. node2 is set as a standby and does not process queries.

Freejack

unread,

Oct 20, 2012, 12:42:18 PM10/20/12

to codersh...@googlegroups.com

Here are the wsrep settings:

mysql> show global variables like '%wsrep%'\G;

*************************** 1. row ***************************

Variable_name: wsrep_OSU_method

Value: TOI

*************************** 2. row ***************************

Variable_name: wsrep_auto_increment_control

Value: ON

*************************** 3. row ***************************

Variable_name: wsrep_causal_reads

Value: OFF

*************************** 4. row ***************************

Variable_name: wsrep_certify_nonPK

Value: ON

*************************** 5. row ***************************

Variable_name: wsrep_cluster_address

Value: gcomm://10.7.25.110

*************************** 6. row ***************************

Variable_name: wsrep_cluster_name

Value: maindb

*************************** 7. row ***************************

Variable_name: wsrep_convert_LOCK_to_trx

Value: OFF

*************************** 8. row ***************************

Variable_name: wsrep_data_home_dir

Value: /var/lib/mysql/

*************************** 9. row ***************************

Variable_name: wsrep_dbug_option

Value:

*************************** 10. row ***************************

Variable_name: wsrep_debug

Value: OFF

*************************** 11. row ***************************

Variable_name: wsrep_drupal_282555_workaround

Value: OFF

*************************** 12. row ***************************

Variable_name: wsrep_forced_binlog_format

Value: NONE

*************************** 13. row ***************************

Variable_name: wsrep_max_ws_rows

Value: 131072

*************************** 14. row ***************************

Variable_name: wsrep_max_ws_size

Value: 1073741824

*************************** 15. row ***************************

Variable_name: wsrep_node_address

Value:

*************************** 16. row ***************************

Variable_name: wsrep_node_incoming_address

Value: 10.7.25.111:3306

*************************** 17. row ***************************

Variable_name: wsrep_node_name

Value: maindb-node0

*************************** 18. row ***************************

Variable_name: wsrep_notify_cmd

Value:

*************************** 19. row ***************************

Variable_name: wsrep_on

Value: ON

*************************** 20. row ***************************

Variable_name: wsrep_provider

Value: /usr/lib64/libgalera_smm.so

*************************** 21. row ***************************

Variable_name: wsrep_provider_options

Value: base_host = 10.7.25.111; base_port = 4567; evs.debug_log_mask = 0x1; evs.inactive_check_period = PT0.5S; evs.inactive_timeout = PT15S; evs.info_log_mask = 0; evs.install_timeout = PT15S; evs.join_retrans_period = PT0.3S; evs.keepalive_period = PT1S; evs.max_install_timeouts = 1; evs.send_window = 4; evs.stats_report_period = PT1M; evs.suspect_timeout = PT5S; evs.use_aggregate = true; evs.user_send_window = 2; evs.version = 0; evs.view_forget_timeout = PT5M; gcache.dir = /var/lib/mysql/; gcache.keep_pages_size = 0; gcache.mem_size = 0; gcache.name = /var/lib/mysql//galera.cache; gcache.page_size = 128M; gcache.size = 128M; gcs.fc_debug = 0; gcs.fc_factor = 0.5; gcs.fc_limit = 16; gcs.fc_master_slave = NO; gcs.max_packet_size = 64500; gcs.max_throttle = 0.25; gcs.recv_q_hard_limit = 9223372036854775807; gcs.recv_q_soft_limit = 0.25; gcs.sync_donor = NO; gmcast.listen_addr = tcp://0.0.0.0:4567; gmcast.mcast_addr = ; gmcast.mcast_ttl = 1; gmcast.peer_timeout = PT3S; gmcast.time_wait = PT5S; gmcast.version = 0; ist.recv_addr = 10.7.25.111; pc.checksum = true; pc.ignore_quorum = false; pc.ignore_sb = false; pc.linger = PT2S; pc.npvo = false; pc.version = 0; protonet.backend = asio; protonet.version = 0; replicator.causal_read_timeout = PT30S; replicator.commit_order = 3

*************************** 22. row ***************************

Variable_name: wsrep_recover

Value: OFF

*************************** 23. row ***************************

Variable_name: wsrep_replicate_myisam

Value: OFF

*************************** 24. row ***************************

Variable_name: wsrep_retry_autocommit

Value: 1

*************************** 25. row ***************************

Variable_name: wsrep_slave_threads

Value: 1

*************************** 26. row ***************************

Variable_name: wsrep_sst_auth

Value: ********

*************************** 27. row ***************************

Variable_name: wsrep_sst_donor

Value:

*************************** 28. row ***************************

Variable_name: wsrep_sst_donor_rejects_queries

Value: OFF

*************************** 29. row ***************************

Variable_name: wsrep_sst_method

Value: mysqldump

*************************** 30. row ***************************

Variable_name: wsrep_sst_receive_address

Value: 10.7.25.111

*************************** 31. row ***************************

Variable_name: wsrep_start_position

Value: 00000000-0000-0000-0000-000000000000:-1

Henrik Ingo

unread,

Oct 20, 2012, 3:30:21 PM10/20/12

to Freejack, codersh...@googlegroups.com

Actually, both Mikkel and Jay have asked about similar issue. Based on those:

Does the table in question have a primary key?
How many rows will the query delete? Is there any way to delete in
smaller chunks btw?

Jay had the theory that keeping wsrep_slave_threads=1 will fix such
issues, but you seem to have it at =1 and experience it nevertheless.

henrik

> --
>
>

--
henri...@avoinelama.fi
+358-40-8211286 skype: henrik.ingo irc: hingo
www.openlife.cc

My LinkedIn profile: http://www.linkedin.com/profile/view?id=9522559

Freejack

unread,

Oct 20, 2012, 5:46:03 PM10/20/12

to codersh...@googlegroups.com, Freejack, henri...@avoinelama.fi

Hi. The table does not have a primary key but it does have an index which is a date/time field. This is the query that is run:

delete from messages where left(insert_time,10) <> '2012-10-18';

The insert_time field is the index. This will delete about 90% of the rows in the table. I don't think deleting in chunks will help in this case.

Thanks for your response.

Freejack

unread,

Oct 20, 2012, 6:49:15 PM10/20/12

to codersh...@googlegroups.com, Freejack, henri...@avoinelama.fi

In another thread it was recommended to set binlog_format to STATEMENT. When I tried this I got the following:

121020 18:46:02 [ERROR] WSREP: Only binlog_format = 'ROW' is currently supported. Configured value: 'STATEMENT'. Please adjust your configuration.

121020 18:46:02 [ERROR] Aborting

Is there anyway to override this?

Alex Yurchenko

unread,

Oct 20, 2012, 11:30:11 PM10/20/12

to codersh...@googlegroups.com

Hi,

1) You should be able to set it in runtime.

2) 'STATEMENT' format may help if you have memory issues with DELETE.
It is unlikely that it will help with speed.

Now, what is happening here:

Ranged DELETE is a heavy operation in general. Lack of primary keys
makes it even heavier for wsrep replication since it needs to generate
row IDs on its own. What you have there is that
- at first DELETE runs on a "master" node, which may take minutes, but
you don't notice it much because you have other sessions running in
parallel.
- then the result of DELETE is replicated (copied) to "slave". This is
where 'STATEMENT' format may help, since DELETE statement obviously
takes much less space than the binary representation of modified rows.
But it helps only if you have a very slow link or memory issues.
- then the "slave" starts to apply the event. In my experience deleting
3M rows on "slave" takes considerably shorter than on "master" when
using 'ROW' format. With 'STATEMENT' format I expect it to take equally
long.

Still, if it took 3 minutes on "master", it may take like 2 minutes on
"slave". Note that at this moment all the "master" has to do is just
commit the changes to storage, which is relatively fast, while slave has
yet to *apply* them first - and this is a time consuming operation. So
"master" and "slave" are in a very uneven positions here, with "master"
ready to process new requests, while slave still has most of the work
ahead of it.

Now, new replication events from "master" start to pile up on "slave" -
because its *only* slave thread is busy processing the DELETE. Since
we're trying to maintain some appearance of actual synchrony, "master"
blocks waiting for "slave" to finish its work and catch up.

Had you have several slave threads configured that issue could have
been mitigated to at least some extent, although for that period the
"slave" would be in "eventually consistent" state. But you don't have
primary keys on that table, and it seems like deletes on tables without
PKs don't work that well with parallel applying.

Another option is to increase slave lag tolerance. This is done on
slave by increasing flow control limits. E.g.

mysql> SET GLOBAL wsrep_provider_options="gcs.fc_limit=1000000;
gc.fc_factor=1.0";

This will allow slave to fall behind by 1000000 transactions - but it
also means that most of writes to that node will fail certification.

Another thing is that you can probably use DELETE...LIMIT... syntax to
make DELETE less chunky.

Regards,
Alex

--
Alexey Yurchenko,
Codership Oy, www.codership.com
Skype: alexey.yurchenko, Phone: +358-400-516-011

Henrik Ingo

unread,

Oct 21, 2012, 7:56:53 AM10/21/12

to Freejack, codersh...@googlegroups.com

That's the problem then. This happens on deletes to tables without
primary key (or at least a unique key).

If there is no column in the table that you could make a primary key,
then you could simply add an "id" field with auto_increment. This
should help the problem go away.

However, note that we have just discovered another problem here. With
a default Galera installation, doing:

ALTER TABLE ... ADD COLUMN id ... auto_increment...

has surprising side effects. For existing rows the new column will be
populated with id values, however they will be different on each node.
This of course has very bad consequences!

If you can afford some planned downtime, then the easiest way to avoid
this problem is to shut down all nodes but one, then do the ALTER
TABLE (on all MySQL versions prior to 5.6 this will block other
queries to that table).

When ALTER TABLE is finished, bring back the other nodes like this:
- delete /var/lib/mysql/grastate.dat
- join the nodes normally
- note that unless you use xtrabackup sst, the first joining node
will again cause the online node to be block for the time of the
snapshot.

There is a way to do the ALTER TABLE ... ADD ... auto_increment in a
way that doesn't cause any downtime. Explaining that takes a bit
longer though. I will have to do it another time...

henrik

Henrik Ingo

unread,

Oct 21, 2012, 8:02:37 AM10/21/12

to Freejack, codersh...@googlegroups.com

Oh, I forgot to address the most obvious part:

On Sun, Oct 21, 2012 at 12:46 AM, Freejack <free...@gmail.com> wrote:

> delete from messages where left(insert_time,10) <> '2012-10-18';
>
> The insert_time field is the index. This will delete about 90% of the rows
> in the table. I don't think deleting in chunks will help in this case.

It looks like you are simply deleting some old rows that are no longer
used. Why can't you split this into smaller chunks? Just add LIMIT 20
to the end of the query and run it several times.

This might very well help as a workaround, until you are able to add
the primary key.

henrik

Freejack

unread,

Oct 21, 2012, 3:35:30 PM10/21/12

to codersh...@googlegroups.com

Thanks for all the really useful feedback guys. Will work on this tonight and let you know how it goes.

Jay Janssen

unread,

Oct 21, 2012, 4:45:02 PM10/21/12

to henri...@avoinelama.fi, Freejack, codersh...@googlegroups.com

On Oct 20, 2012, at 2:30 PM, Henrik Ingo <henri...@avoinelama.fi> wrote:

Jay had the theory that keeping wsrep_slave_threads=1 will fix such
issues, but you seem to have it at =1 and experience it nevertheless.

My issue resulted in inconsistent cluster nodes (and a forced SST), not the particular symptoms he is describing

Jay Janssen, Senior MySQL Consultant, Percona Inc.

http://about.me/jay.janssen

Percona Live in London, UK Dec 3-4th: http://www.percona.com/live/london-2012/

Jay Janssen

unread,

Oct 21, 2012, 4:46:26 PM10/21/12

to henri...@avoinelama.fi, Freejack, codersh...@googlegroups.com

On Oct 21, 2012, at 6:56 AM, Henrik Ingo <henri...@avoinelama.fi> wrote:

ALTER TABLE ... ADD COLUMN id ... auto_increment...

has surprising side effects. For existing rows the new column will be
populated with id values, however they will be different on each node.
This of course has very bad consequences!

Henrik,

I presume this was simply wsrep_autoincrement_control biting you?

Alex Yurchenko

unread,

Oct 22, 2012, 12:46:17 PM10/22/12

to codersh...@googlegroups.com

Hi everybody,

Just thought that I should elaborate on this a bit more.

This problem is not Galera specific. Huge transactions, including long
ranged UPDATE/INSERT/DELETE autocommit statements, is a bane of any
"bottomware" (as opposed to "middleware") replication, including the
native MySQL async replication and is a major source of slave lag there.
It is due to slave being able to start processing the transaction only
*after* master has completed it. Splitting such big events into smaller
ones is a *generally* good practice that allows slave to run more
concurrently with master.

Regards,
Alex

Freejack

unread,

Oct 22, 2012, 5:16:55 PM10/22/12

to codersh...@googlegroups.com

Adding a PK to the table resolved the issue for us. Whereas before it would take 6 hours for the replication to complete, now it only take 5 seconds. We will look into breaking up the queries for even better performance. Thanks for your suggestion.

Reply all

Reply to author

Forward