queue retry policy

7 views

Skip to first unread message

Andrei Matei

unread,

Aug 31, 2018, 3:02:47 PM8/31/18

to Alex Robinson, CockroachDB

Hey Alex, all,

Is anyone qualified to speak to the retry policy that our queues (and the split queue in particular) employ?
I debugged this thing recently where a few ranges couldn't be split (old issue, a large abort span is likely generating a command to large for Raft). Those ranges (~ as many as the queue's concurrency) were the only ones attempted by the split queue, over and over again, so all the others were starving.

What I'd like to understand is if/to what extent any thought and/or code has ever been put into these retries. What would we like to happen in these situations?
I know that the queues have a notion of priority per range (which in the case of the split one is presumably tied to a range's size (in my case, presumably these ranges had high priorities). I also think I know that, if the scanner runs into a range that is currently being processed by a queue, it sets a "reque" bit on it, which makes it re-enter the queue immediately on error. I think in my case this might have been playing a role, since the processing of one of these range was taking minutes each time (so perhaps enough for the scanner to get to the range again).
Separately, I know we also have a "purgatory" about which I'm not entirely clear - its role is to speed up even more the reprocessing on some select errors? But I guess it is separate from the queue's concurrency limit, and so it's not supposed to starve it?

Can anyone describe our policy with some clarity or offer an opinion about what, if anything, is to blame for this apparent starvation scenario?

Thanks!

Nikhil Benesch

unread,

Aug 31, 2018, 11:52:33 PM8/31/18

to Andrei Matei, Alex Robinson, CockroachDB

Purgatory was originally introduced for the replicate queue to prevent a one-node cluster from continually trying and failing to upreplicate every single range. Ranges in the replicate queue's purgatory are only eligible for re-processing when new store information is gossiped, indicating that a new replication target is likely available.

In the split queue, purgatory ranges are instead eligible for retry every minute. It's possible that, if the failing splits are taking a full minute to fail, we end up spinning hot on those failing splits and never make progress on anything else. This is unfortunate. Putting these unsplittable ranges into purgatory was actually supposed to make the situation better. (When a split on a given range is destined to fail but fails in, say, 1s instead of 1m, putting that range into purgatory actually improves the situation since it decreases the frequency of retries.)

It's possible we should better tune this logic to prevent starvation—but minutes-long splits are horrible in their own right. Don't they block all traffic to the ranges while they're executing?

--
You received this message because you are subscribed to the Google Groups "CockroachDB" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cockroach-db+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/cockroach-db/CAPqkKgm%2BsQRuWnNUdEKUJ0u4VwSLfZ5DmXU3JeAmUg6FTcTL1A%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Peter Mattis

unread,

Sep 1, 2018, 12:55:43 PM9/1/18

to Nikhil Benesch, Andrei Matei, Alex Robinson, CockroachDB

For the most part, are retry policy for queues is simplistic. Yes, we have purgatory errors for when we know the current cluster configuration is unable to handle an operation (i.e. up-replication in a single-node cluster), but I think there are a lot of other errors, especially in queues other than the replicate queue, where we retry on error too quickly. The split queue places ranges in purgatory if they can't be split due to size, though I wonder what problems that might be causing as I think some other queues don't operate on a range which requires a split (see `AcceptsUnsplitRanges`). My guess is that if you suspect there is an error condition that is not being handled properly by the queues, you are probably right.

To view this discussion on the web visit https://groups.google.com/d/msgid/cockroach-db/CAA-gNq%2BR1NJajTWoRib_h4GW30mJTBahnWUvRuZ95W3eRxjmyw%40mail.gmail.com.

Peter Mattis

unread,

Sep 1, 2018, 1:00:04 PM9/1/18

to Nikhil Benesch, Andrei Matei, Alex Robinson, CockroachDB

As an example of badness with accepts-unsplit-ranges, both the replicate queue and the GC queue do not process ranges which require splitting. If we can't split a range for some reason, we'll allow MVCC garbage to build up slowing down operations. If we can't split a range we won't up-replicate it if one of its replicas is on a dead node. There are good reasons behind these decisions. My point here is that not being able to split a range is a serious condition that deserves to be brought to the operators attention.

Nathan VanBenschoten

unread,

Sep 1, 2018, 2:32:02 PM9/1/18

to Peter Mattis, Nikhil Benesch, Andrei Matei, Alex Robinson, CockroachDB

We have an open issue to allow the replicate queue to process ranges even if they could use a split: https://github.com/cockroachdb/cockroach/issues/25047.

We should push to eliminate these dependencies between queues as much as possible because we have seen in the past (e.g. https://github.com/cockroachdb/cockroach/pull/20589) that they introduce subtle deadlock scenarios which are difficult to track down. We're also in a better position than we were at the time that most of the accepts-unsplit-ranges checks were added because write backpressure should ensure that ranges won't grow indefinitely even when splits are delayed.

On Sat, Sep 1, 2018 at 1:00 PM Peter Mattis <pe...@cockroachlabs.com> wrote:

As an example of badness with accepts-unsplit-ranges, both the replicate queue and the GC queue do not process ranges which require splitting. If we can't split a range for some reason, we'll allow MVCC garbage to build up slowing down operations. If we can't split a range we won't up-replicate it if one of its replicas is on a dead node. There are good reasons behind these decisions. My point here is that not being able to split a range is a serious condition that deserves to be brought to the operators attention.

On Sat, Sep 1, 2018 at 12:55 PM, Peter Mattis <pe...@cockroachlabs.com> wrote:

For the most part, are retry policy for queues is simplistic. Yes, we have purgatory errors for when we know the current cluster configuration is unable to handle an operation (i.e. up-replication in a single-node cluster), but I think there are a lot of other errors, especially in queues other than the replicate queue, where we retry on error too quickly. The split queue places ranges in purgatory if they can't be split due to size, though I wonder what problems that might be causing as I think some other queues don't operate on a range which requires a split (see `AcceptsUnsplitRanges`). My guess is that if you suspect there is an error condition that is not being handled properly by the queues, you are probably right.

On Fri, Aug 31, 2018 at 11:51 PM, Nikhil Benesch <ben...@cockroachlabs.com> wrote:

Purgatory was originally introduced for the replicate queue to prevent a one-node cluster from continually trying and failing to upreplicate every single range. Ranges in the replicate queue's purgatory are only eligible for re-processing when new store information is gossiped, indicating that a new replication target is likely available.

In the split queue, purgatory ranges are instead eligible for retry every minute. It's possible that, if the failing splits are taking a full minute to fail, we end up spinning hot on those failing splits and never make progress on anything else. This is unfortunate. Putting these unsplittable ranges into purgatory was actually supposed to make the situation better. (When a split on a given range is destined to fail but fails in, say, 1s instead of 1m, putting that range into purgatory actually improves the situation since it decreases the frequency of retries.)

It's possible we should better tune this logic to prevent starvation—but minutes-long splits are horrible in their own right. Don't they block all traffic to the ranges while they're executing?

On Fri, Aug 31, 2018 at 3:02 PM, Andrei Matei <and...@cockroachlabs.com> wrote:

Hey Alex, all,

Is anyone qualified to speak to the retry policy that our queues (and the split queue in particular) employ?
I debugged this thing recently where a few ranges couldn't be split (old issue, a large abort span is likely generating a command to large for Raft). Those ranges (~ as many as the queue's concurrency) were the only ones attempted by the split queue, over and over again, so all the others were starving.
What I'd like to understand is if/to what extent any thought and/or code has ever been put into these retries. What would we like to happen in these situations?
I know that the queues have a notion of priority per range (which in the case of the split one is presumably tied to a range's size (in my case, presumably these ranges had high priorities). I also think I know that, if the scanner runs into a range that is currently being processed by a queue, it sets a "reque" bit on it, which makes it re-enter the queue immediately on error. I think in my case this might have been playing a role, since the processing of one of these range was taking minutes each time (so perhaps enough for the scanner to get to the range again).
Separately, I know we also have a "purgatory" about which I'm not entirely clear - its role is to speed up even more the reprocessing on some select errors? But I guess it is separate from the queue's concurrency limit, and so it's not supposed to starve it?

Can anyone describe our policy with some clarity or offer an opinion about what, if anything, is to blame for this apparent starvation scenario?

Thanks!

--
You received this message because you are subscribed to the Google Groups "CockroachDB" group.

To unsubscribe from this group and stop receiving emails from it, send an email to cockroach-db...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/cockroach-db/CAPqkKgm%2BsQRuWnNUdEKUJ0u4VwSLfZ5DmXU3JeAmUg6FTcTL1A%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "CockroachDB" group.

To unsubscribe from this group and stop receiving emails from it, send an email to cockroach-db...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/cockroach-db/CAA-gNq%2BR1NJajTWoRib_h4GW30mJTBahnWUvRuZ95W3eRxjmyw%40mail.gmail.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "CockroachDB" group.

To unsubscribe from this group and stop receiving emails from it, send an email to cockroach-db...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/cockroach-db/CANKgOKh3Dr3WWBu35E5LFXCgwuQ%2BtsRiAu1OiKS_2aCiSMjLCA%40mail.gmail.com.

Andrei Matei

unread,

Sep 5, 2018, 5:31:52 PM9/5/18

to Nathan VanBenschoten, Peter Mattis, Nikhil Benesch, Alex Robinson, CockroachDB

> The split queue places ranges in purgatory if they can't be split due to size

Citation needed pls, as I don't think this is what I was seeing (at least in 2.0.2). To make sure we understand each other, when you say "can't be split due to size", you mean what I mean (the split Raft command ends up being over 64MB), right?

Nathan VanBenschoten

unread,

Sep 5, 2018, 5:38:10 PM9/5/18

to Andrei Matei, Peter Mattis, Nikhil Benesch, Alex Robinson, CockroachDB

This has only been true since https://github.com/cockroachdb/cockroach/pull/25245, which is not present in 2.0.

To make sure we understand each other, when you say "can't be split due to size", you mean what I mean (the split Raft command ends up being over 64MB), right?

No, ranges are placed in the split queue purgatory if they are "unsplittable", which for all intents and purposes means that they consist of only a single SQL row (with possible many MVCC versions).

Reply all

Reply to author

Forward

0 new messages