Hi all. Just deployed a second galera cluster to our production environment and we're seeing an issue where replicating a large delete query to the secondary node will prevent writes on the first node. node2 seems to be stuck running Delete_rows_log_event::find_row(137577) while the write queries are piling up on node1 to the point where all connections are used up and the db is no longer accessible.
Based on another similar thread, this may possibly be due to my innodb settings although I can't rule out galera at this point. Please see below. Any suggestions/tips would be hugely appreciated! Thanks!
On Sat, Oct 20, 2012 at 7:04 PM, Freejack <freej...@gmail.com> wrote:
> Hi all. Just deployed a second galera cluster to our production environment
> and we're seeing an issue where replicating a large delete query to the
> secondary node will prevent writes on the first node. node2 seems to be
> stuck running Delete_rows_log_event::find_row(137577) while the write
> queries are piling up on node1 to the point where all connections are used
> up and the db is no longer accessible.
> Based on another similar thread, this may possibly be due to my innodb
> settings although I can't rule out galera at this point. Please see below.
> Any suggestions/tips would be hugely appreciated! Thanks!
On Saturday, October 20, 2012 3:30:21 PM UTC-4, Henrik Ingo wrote:
> Actually, both Mikkel and Jay have asked about similar issue. Based on > those:
> Does the table in question have a primary key? > How many rows will the query delete? Is there any way to delete in > smaller chunks btw?
> Jay had the theory that keeping wsrep_slave_threads=1 will fix such > issues, but you seem to have it at =1 and experience it nevertheless.
> henrik
> On Sat, Oct 20, 2012 at 7:04 PM, Freejack <free...@gmail.com <javascript:>> > wrote: > > Hi all. Just deployed a second galera cluster to our production > environment > > and we're seeing an issue where replicating a large delete query to the > > secondary node will prevent writes on the first node. node2 seems to be > > stuck running Delete_rows_log_event::find_row(137577) while the write > > queries are piling up on node1 to the point where all connections are > used > > up and the db is no longer accessible.
> > Based on another similar thread, this may possibly be due to my innodb > > settings although I can't rule out galera at this point. Please see > below. > > Any suggestions/tips would be hugely appreciated! Thanks!
> In another thread it was recommended to set binlog_format to > STATEMENT.
> When I tried this I got the following:
> 121020 18:46:02 [ERROR] WSREP: Only binlog_format = 'ROW' is > currently
> supported. Configured value: 'STATEMENT'. Please adjust your > configuration.
> 121020 18:46:02 [ERROR] Aborting
> Is there anyway to override this?
Hi,
1) You should be able to set it in runtime.
2) 'STATEMENT' format may help if you have memory issues with DELETE. It is unlikely that it will help with speed.
Now, what is happening here:
Ranged DELETE is a heavy operation in general. Lack of primary keys makes it even heavier for wsrep replication since it needs to generate row IDs on its own. What you have there is that
- at first DELETE runs on a "master" node, which may take minutes, but you don't notice it much because you have other sessions running in parallel.
- then the result of DELETE is replicated (copied) to "slave". This is where 'STATEMENT' format may help, since DELETE statement obviously takes much less space than the binary representation of modified rows. But it helps only if you have a very slow link or memory issues.
- then the "slave" starts to apply the event. In my experience deleting 3M rows on "slave" takes considerably shorter than on "master" when using 'ROW' format. With 'STATEMENT' format I expect it to take equally long.
Still, if it took 3 minutes on "master", it may take like 2 minutes on "slave". Note that at this moment all the "master" has to do is just commit the changes to storage, which is relatively fast, while slave has yet to *apply* them first - and this is a time consuming operation. So "master" and "slave" are in a very uneven positions here, with "master" ready to process new requests, while slave still has most of the work ahead of it.
Now, new replication events from "master" start to pile up on "slave" - because its *only* slave thread is busy processing the DELETE. Since we're trying to maintain some appearance of actual synchrony, "master" blocks waiting for "slave" to finish its work and catch up.
Had you have several slave threads configured that issue could have been mitigated to at least some extent, although for that period the "slave" would be in "eventually consistent" state. But you don't have primary keys on that table, and it seems like deletes on tables without PKs don't work that well with parallel applying.
Another option is to increase slave lag tolerance. This is done on slave by increasing flow control limits. E.g.
mysql> SET GLOBAL wsrep_provider_options="gcs.fc_limit=1000000; gc.fc_factor=1.0";
This will allow slave to fall behind by 1000000 transactions - but it also means that most of writes to that node will fail certification.
Another thing is that you can probably use DELETE...LIMIT... syntax to make DELETE less chunky.
Regards,
Alex
-- Alexey Yurchenko,
Codership Oy, www.codership.com Skype: alexey.yurchenko, Phone: +358-400-516-011
That's the problem then. This happens on deletes to tables without
primary key (or at least a unique key).
If there is no column in the table that you could make a primary key,
then you could simply add an "id" field with auto_increment. This
should help the problem go away.
However, note that we have just discovered another problem here. With
a default Galera installation, doing:
ALTER TABLE ... ADD COLUMN id ... auto_increment...
has surprising side effects. For existing rows the new column will be
populated with id values, however they will be different on each node.
This of course has very bad consequences!
If you can afford some planned downtime, then the easiest way to avoid
this problem is to shut down all nodes but one, then do the ALTER
TABLE (on all MySQL versions prior to 5.6 this will block other
queries to that table).
When ALTER TABLE is finished, bring back the other nodes like this:
- delete /var/lib/mysql/grastate.dat
- join the nodes normally
- note that unless you use xtrabackup sst, the first joining node
will again cause the online node to be block for the time of the
snapshot.
There is a way to do the ALTER TABLE ... ADD ... auto_increment in a
way that doesn't cause any downtime. Explaining that takes a bit
longer though. I will have to do it another time...
On Sun, Oct 21, 2012 at 12:46 AM, Freejack <freej...@gmail.com> wrote:
> Hi. The table does not have a primary key but it does have an index which
> is a date/time field. This is the query that is run:
> delete from messages where left(insert_time,10) <> '2012-10-18';
> The insert_time field is the index. This will delete about 90% of the rows
> in the table. I don't think deleting in chunks will help in this case.
> Thanks for your response.
> On Saturday, October 20, 2012 3:30:21 PM UTC-4, Henrik Ingo wrote:
>> Actually, both Mikkel and Jay have asked about similar issue. Based on
>> those:
>> Does the table in question have a primary key?
>> How many rows will the query delete? Is there any way to delete in
>> smaller chunks btw?
>> Jay had the theory that keeping wsrep_slave_threads=1 will fix such
>> issues, but you seem to have it at =1 and experience it nevertheless.
>> henrik
>> On Sat, Oct 20, 2012 at 7:04 PM, Freejack <free...@gmail.com> wrote:
>> > Hi all. Just deployed a second galera cluster to our production
>> > environment
>> > and we're seeing an issue where replicating a large delete query to the
>> > secondary node will prevent writes on the first node. node2 seems to be
>> > stuck running Delete_rows_log_event::find_row(137577) while the write
>> > queries are piling up on node1 to the point where all connections are
>> > used
>> > up and the db is no longer accessible.
>> > Based on another similar thread, this may possibly be due to my innodb
>> > settings although I can't rule out galera at this point. Please see
>> > below.
>> > Any suggestions/tips would be hugely appreciated! Thanks!
On Sun, Oct 21, 2012 at 12:46 AM, Freejack <freej...@gmail.com> wrote:
> delete from messages where left(insert_time,10) <> '2012-10-18';
> The insert_time field is the index. This will delete about 90% of the rows
> in the table. I don't think deleting in chunks will help in this case.
It looks like you are simply deleting some old rows that are no longer
used. Why can't you split this into smaller chunks? Just add LIMIT 20
to the end of the query and run it several times.
This might very well help as a workaround, until you are able to add
the primary key.
On Oct 21, 2012, at 6:56 AM, Henrik Ingo <henrik.i...@avoinelama.fi> wrote:
> ALTER TABLE ... ADD COLUMN id ... auto_increment...
> has surprising side effects. For existing rows the new column will be
> populated with id values, however they will be different on each node.
> This of course has very bad consequences!
Henrik,
I presume this was simply wsrep_autoincrement_control biting you?
Just thought that I should elaborate on this a bit more.
This problem is not Galera specific. Huge transactions, including long ranged UPDATE/INSERT/DELETE autocommit statements, is a bane of any "bottomware" (as opposed to "middleware") replication, including the native MySQL async replication and is a major source of slave lag there. It is due to slave being able to start processing the transaction only *after* master has completed it. Splitting such big events into smaller ones is a *generally* good practice that allows slave to run more concurrently with master.
> On 2012-10-21 01:49, Freejack wrote:
>> In another thread it was recommended to set binlog_format to >> STATEMENT.
>> When I tried this I got the following:
>> 121020 18:46:02 [ERROR] WSREP: Only binlog_format = 'ROW' is >> currently
>> supported. Configured value: 'STATEMENT'. Please adjust your >> configuration.
>> 121020 18:46:02 [ERROR] Aborting
>> Is there anyway to override this?
> Hi,
> 1) You should be able to set it in runtime.
> 2) 'STATEMENT' format may help if you have memory issues with DELETE.
> It is unlikely that it will help with speed.
> Now, what is happening here:
> Ranged DELETE is a heavy operation in general. Lack of primary keys
> makes it even heavier for wsrep replication since it needs to > generate
> row IDs on its own. What you have there is that
> - at first DELETE runs on a "master" node, which may take minutes,
> but you don't notice it much because you have other sessions running
> in parallel.
> - then the result of DELETE is replicated (copied) to "slave". This
> is where 'STATEMENT' format may help, since DELETE statement > obviously
> takes much less space than the binary representation of modified > rows.
> But it helps only if you have a very slow link or memory issues.
> - then the "slave" starts to apply the event. In my experience
> deleting 3M rows on "slave" takes considerably shorter than on
> "master" when using 'ROW' format. With 'STATEMENT' format I expect it
> to take equally long.
> Still, if it took 3 minutes on "master", it may take like 2 minutes
> on "slave". Note that at this moment all the "master" has to do is
> just commit the changes to storage, which is relatively fast, while
> slave has yet to *apply* them first - and this is a time consuming
> operation. So "master" and "slave" are in a very uneven positions
> here, with "master" ready to process new requests, while slave still
> has most of the work ahead of it.
> Now, new replication events from "master" start to pile up on "slave"
> - because its *only* slave thread is busy processing the DELETE. > Since
> we're trying to maintain some appearance of actual synchrony, > "master"
> blocks waiting for "slave" to finish its work and catch up.
> Had you have several slave threads configured that issue could have
> been mitigated to at least some extent, although for that period the
> "slave" would be in "eventually consistent" state. But you don't have
> primary keys on that table, and it seems like deletes on tables
> without PKs don't work that well with parallel applying.
> Another option is to increase slave lag tolerance. This is done on
> slave by increasing flow control limits. E.g.
> mysql> SET GLOBAL wsrep_provider_options="gcs.fc_limit=1000000;
> gc.fc_factor=1.0";
> This will allow slave to fall behind by 1000000 transactions - but it
> also means that most of writes to that node will fail certification.
> Another thing is that you can probably use DELETE...LIMIT... syntax
> to make DELETE less chunky.
> Regards,
> Alex
-- Alexey Yurchenko,
Codership Oy, www.codership.com Skype: alexey.yurchenko, Phone: +358-400-516-011
Adding a PK to the table resolved the issue for us. Whereas before it would take 6 hours for the replication to complete, now it only take 5 seconds. We will look into breaking up the queries for even better performance. Thanks for your suggestion.
On Monday, October 22, 2012 12:46:19 PM UTC-4, Alexey Yurchenko wrote:
> Hi everybody,
> Just thought that I should elaborate on this a bit more.
> This problem is not Galera specific. Huge transactions, including long > ranged UPDATE/INSERT/DELETE autocommit statements, is a bane of any > "bottomware" (as opposed to "middleware") replication, including the > native MySQL async replication and is a major source of slave lag there. > It is due to slave being able to start processing the transaction only > *after* master has completed it. Splitting such big events into smaller > ones is a *generally* good practice that allows slave to run more > concurrently with master.
> Regards, > Alex
> On 2012-10-21 06:30, Alex Yurchenko wrote: > > On 2012-10-21 01:49, Freejack wrote: > >> In another thread it was recommended to set binlog_format to > >> STATEMENT. > >> When I tried this I got the following:
> >> 121020 18:46:02 [ERROR] WSREP: Only binlog_format = 'ROW' is > >> currently > >> supported. Configured value: 'STATEMENT'. Please adjust your > >> configuration. > >> 121020 18:46:02 [ERROR] Aborting
> >> Is there anyway to override this?
> > Hi,
> > 1) You should be able to set it in runtime.
> > 2) 'STATEMENT' format may help if you have memory issues with DELETE. > > It is unlikely that it will help with speed.
> > Now, what is happening here:
> > Ranged DELETE is a heavy operation in general. Lack of primary keys > > makes it even heavier for wsrep replication since it needs to > > generate > > row IDs on its own. What you have there is that > > - at first DELETE runs on a "master" node, which may take minutes, > > but you don't notice it much because you have other sessions running > > in parallel. > > - then the result of DELETE is replicated (copied) to "slave". This > > is where 'STATEMENT' format may help, since DELETE statement > > obviously > > takes much less space than the binary representation of modified > > rows. > > But it helps only if you have a very slow link or memory issues. > > - then the "slave" starts to apply the event. In my experience > > deleting 3M rows on "slave" takes considerably shorter than on > > "master" when using 'ROW' format. With 'STATEMENT' format I expect it > > to take equally long.
> > Still, if it took 3 minutes on "master", it may take like 2 minutes > > on "slave". Note that at this moment all the "master" has to do is > > just commit the changes to storage, which is relatively fast, while > > slave has yet to *apply* them first - and this is a time consuming > > operation. So "master" and "slave" are in a very uneven positions > > here, with "master" ready to process new requests, while slave still > > has most of the work ahead of it.
> > Now, new replication events from "master" start to pile up on "slave" > > - because its *only* slave thread is busy processing the DELETE. > > Since > > we're trying to maintain some appearance of actual synchrony, > > "master" > > blocks waiting for "slave" to finish its work and catch up.
> > Had you have several slave threads configured that issue could have > > been mitigated to at least some extent, although for that period the > > "slave" would be in "eventually consistent" state. But you don't have > > primary keys on that table, and it seems like deletes on tables > > without PKs don't work that well with parallel applying.
> > Another option is to increase slave lag tolerance. This is done on > > slave by increasing flow control limits. E.g.
> > mysql> SET GLOBAL wsrep_provider_options="gcs.fc_limit=1000000; > > gc.fc_factor=1.0";
> > This will allow slave to fall behind by 1000000 transactions - but it > > also means that most of writes to that node will fail certification.
> > Another thing is that you can probably use DELETE...LIMIT... syntax > > to make DELETE less chunky.