let's close the door to deferring WriteTooOld errors

Andrei Matei

unread,

Oct 30, 2019, 6:50:21 PM10/30/19

to CockroachDB, KV, Spencer Kimball

TL;DR CRDB today never defers returning WriteTooOld errors, but we maintain a bunch of code for allowing us to start deferring them again. I'd like to get rid of this code.

I'd like to get rid of the Transaction.write_too_old flag and I want to see if anyone has any opinions.

Cockroach historically has generally taken the position that, upon conflicts, a transaction should go on for as long as possible and lay down as many intents at possible before restarting. The idea is that, by laying down a bunch of intents, the transaction has a higher probability of being able to commit after a restart because other conflicting transactions will have been blocked by those intents. This meant we've defer returning/handling various serializability errors - timestamp pushes and "write too old" conditions. We'd keep track of conflicts and handle them at EndTransaction time (generally by generating a retriable error).

Over time we've dented this position. For example:

in 16874 we've made transactions that can still auto-retry do so eagerly on pushes
in 18858 we've made transactions that don't use `savepoint cockroach_restart` get push errors eagerly since as far as the database is concerned, they're not going to be retried
in 22234 we've made CPuts return WriteTooOld errors eagerly to give the refresh mechanism a chance to swallow the error (later it'd be too late for refreshes since the CPut implies a read on the recently-written key)
in 38668 we've made Puts work like CPuts and return WriteTooOld errors eagerly, this time in order to give auto-retries a chance

The effect of #38668 was that now we never defer WriteTooOldErrors. However, we maintained a bunch of code around to allow us to do so again in the future; the PR in question introduced an option for the client to tell the server whether the client wants deferred WTOE or not. The idea was that SQL would indirectly set that if it can still do automatic retries (*). This improvement never happened and so that field is never set - resulting in the never-deferred WTO status quo.

I'd like to propose that we embrace the status quo and simplify a bunch of code around the inexistent deferred WTO errors - in particular around the Transaction.write_too_old flag. That flag appears to currently never be set on a response from the server to the client and in a request from the client to the server. Which means that its main point of existence - the server returning a retriable error at EndTransaction time - is also dead code. So I'd like to get rid of the flag. It would simplify a bunch of code, and everything we can take out of our transactions' state frees up a bunch of stretched complexity budget.

In doing so, we'd be giving up on the option to easily return to the "let's lay down as many intents as possible" option. When this came up once before Spencer opposed it on the argument that, at least for Puts, not deferring the WTO errors can result in a quadratic behavior whereby a transaction has to refresh once for each conflict, and each refresh needs to read more and more data. I think this behavior is theoretical. At the moment, not only is there no deferment and the world has not collapsed, but in fact we don't even lay done a single intent in case of a WTO (**) to prevent starvation in the case where the transaction is forced to restart.

The question then becomes, I guess, whether we should continue deferring the handling of pushes given that we no longer like deferring WTO. An argument Ben has made for continuing to defer those is that pushes are more common than WTO, and in particular it's more common for a txn to be pushed multiple times over its lifetime, whereas if you order your writes somewhat (which you should generally try to do for deadlock avoidance), multiple WTOs are less common. So I'd leave deferment of pushes intact.

Thanks for reading,

- a_m

(*) I think (and I think that I've also convinced everybody) that this was the wrong way to go. Instead of having the client customize the server's behavior, it's better to put more code in the client to decide when to retry eagerly and when not to (and we already have such code, albeit not perfect).

(**) This is a result of #38668 and it's unclear if it was intentional. I intend to fix it so that we leave behind the intents from the batch encountering the conflict.

Ben Darnell

unread,

Oct 30, 2019, 7:08:08 PM10/30/19

to Andrei Matei, CockroachDB, KV, Spencer Kimball

On Wed, Oct 30, 2019 at 6:50 PM Andrei Matei <and...@cockroachlabs.com> wrote:

At the moment, not only is there no deferment and the world has not collapsed

I generally agree with your conclusions here, but the fact that the world has not yet collapsed has limited weight. #38668 didn't go in until 19.2, which no one but us is running, so this could just be a blind spot in our testing. I'd like to see evidence that the world continues not to collapse after the release of 19.2 before committing to this path.

-Ben

Tobias Grieger

unread,

Oct 31, 2019, 5:03:40 AM10/31/19

to Ben Darnell, Andrei Matei, CockroachDB, KV, Spencer Kimball

I'd also wait a bit until we've seen 19.2 perform in the wild. Down the line, to truly enable global transactions, we're thinking about introducing read locks, which, when held, could significantly cheapen the refresh (though a latency cost stays). But even in 20.1, it generally seems appropriate to refresh early by default and opt out of that behavior through SELECT FOR UPDATE, which should be able to address the pathological cases we're worried about here.

Come to think of it though, I wonder what the efficient way to use SELECT FOR UPDATE will be. For example, if the transaction carries out three CPuts each of which hits a different table, and each of them will catch a WriteTooOld, would we recommend to our customers something like this:

SELECT v FROM c WHERE id = 1 FOR UPDATE UNION ALL SELECT v FROM b WHERE id = 2 FOR UPDATE UNION ALL SELECT v FROM a WHERE id = 3 FOR UPDATE;

(I assume there's a benefit to keeping this all in the same transaction, namely avoiding uncertainty restarts or the like). It is somehow awkward that SFU forces us to actually return rows to the client, it would be nicer to think of it as a blind write, which of course should never be able to get WriteTooOld.

Andrei Matei

unread,

Oct 31, 2019, 11:38:40 AM10/31/19

to Tobias Grieger, Ben Darnell, CockroachDB, KV, Spencer Kimball

On Thu, Oct 31, 2019 at 5:03 AM Tobias Grieger <tob...@cockroachlabs.com> wrote:

I'd also wait a bit until we've seen 19.2 perform in the wild. Down the line, to truly enable global transactions, we're thinking about introducing read locks, which, when held, could significantly cheapen the refresh (though a latency cost stays). But even in 20.1, it generally seems appropriate to refresh early by default and opt out of that behavior through SELECT FOR UPDATE, which should be able to address the pathological cases we're worried about here.
Come to think of it though, I wonder what the efficient way to use SELECT FOR UPDATE will be. For example, if the transaction carries out three CPuts each of which hits a different table, and each of them will catch a WriteTooOld, would we recommend to our customers something like this:

SELECT v FROM c WHERE id = 1 FOR UPDATE UNION ALL SELECT v FROM b WHERE id = 2 FOR UPDATE UNION ALL SELECT v FROM a WHERE id = 3 FOR UPDATE;

In this scenario the client intents to INSERT into a,b and c? Or where are the CPuts coming from?
If it's about upcoming INSERTs, mind spelling out for me what the point of the SFU is exactly (as opposed to just INSERTing from the get-go)?

Btw, I wasn't exactly imagining that the union you wrote would be executed particularly efficiently. There's no refreshing at statement boundaries; currently only refresh when we get a BatchResponse. But there could be...

(I assume there's a benefit to keeping this all in the same transaction, namely avoiding uncertainty restarts or the like). It is somehow awkward that SFU forces us to actually return rows to the client, it would be nicer to think of it as a blind write, which of course should never be able to get WriteTooOld.

Well but usually I think you do want your SFUs to return data - you're going to use that data for your upcoming UPDATE. I think the only blind writes we currently do are for UPSERTs. So in what case would a blind-write version of SFU help?

Nathan VanBenschoten

unread,

Oct 31, 2019, 12:24:06 PM10/31/19

to Ben Darnell, Andrei Matei, CockroachDB, KV, Spencer Kimball

#38668 didn't go in until 19.2, which no one but us is running, so this could just be a blind spot in our testing. I'd like to see evidence that the world continues not to collapse after the release of 19.2 before committing to this path.

This change also made it into 19.1.4 in https://github.com/cockroachdb/cockroach/pull/39087, so it has seen some testing in the wild.

I also agree with most of what Andrei said in the original thread. The introduction of transaction refreshing changed the equation and decisions made before we could refresh transactions likely no longer hold. That said, I have always found it difficult to reason about the performance impact of eager vs. deferred handling of these errors. What kinds of tests could we run to empirically test which approach is better?

The other thing to point out is that it's not even clear why we have the WriteTooOld flag anymore. For serializable transactions, I don't believe there's any difference between a transaction trying to write under a write and a transaction trying to write under a read. If we got rid of the WriteTooOld flag I think everything would still work because the EndTransaction would return a retry error when it notices a pushed timestamp. That lends some weight towards unifying the two concepts and getting rid of the concept of WriteTooOld entirely.

Tobias Grieger

unread,

Nov 1, 2019, 6:11:28 AM11/1/19

to Nathan VanBenschoten, Ben Darnell, Andrei Matei, CockroachDB, KV, Spencer Kimball

I agree that the WriteTooOld flag is not needed any more. But I think we should generally move away from communicating transaction restarts through errors but trigger them based on the introspection of the transaction state. So in a sense I am saying that I agree with your stance to retry eagerly, but rather by getting rid of WriteTooOldError but instead checking whether the txn got pushed.

To illustrate this, consider a deployment with a WAN RTT between regions of 100ms and a closed timestamp target duration of 25ms. This is extreme for the purpose of illustration, but we want our transactions to work in that setting because it allows us to read optimistically from followers (our only option to make global txns truly work).

The example transaction tries to do a few CPuts, all on remote leaseholders. This doesn't go smoothly:

- txn timestamp is 0

- txn contacts leaseholder: 50ms

- leaseholder evaluates CPut, but t=25 is already closed out, so txn gets pushed

- txn eagerly refreshes (works because no results returned yet)

- retries CPut, but runs into same problem

We basically need Nathan's idea from https://github.com/cockroachdb/cockroach/issues/36478#issuecomment-521903404 which says that once you've laid down an intent, you get to ignore closed timestamps. We can't lay down an intent AND return an error, so what we should do is lay down the intent and and trigger the restart proactively by observing that the txn will need one.

We can (and should) take this a step further and make the intent that we're laying down useful. For example, a CPut that gets the WriteTooOld flag will today lay down its intent basing the CPut on the original read timestamp. It should instead base it on the latest write, at which point the transaction *must* refresh (or restart) before it returns the result to the client (otherwise, it's reading from an inconsistent snapshot). This saves a round of consensus because once the refresh (which does not need to refresh the CPut itself) is done, the result of the CPut can be given to the client. The above closed timestamp example would become

- txn timestamp is 0

- contact leaseholder: 50ms

- CPut gets evaluated at and pushed to t=25

- txn timestamp is 25, nothing to refresh

- contact leaseholder for second CPut: 50ms

- CPut gets eval'ed at and pushed to <some timestamp, doesn't matter>

- txn refreshes first CPut, can proceed to third CPut

- ...

As I mentioned earlier, we think we'll get to a place in which we can make these refreshes cheaper in the sense that they'll have to read very little data (because they can simply check a read lock) and where we can potentially even avoid some of them completely with some extra work.

Tobias Grieger

unread,

Nov 1, 2019, 6:20:42 AM11/1/19

to Andrei Matei, Ben Darnell, CockroachDB, KV, Spencer Kimball

Yes, the clients insert into a, b, c. SELECT FOR UPDATE has a hard time avoiding restarts because it's... fundamentally also a SELECT? If you remove the select part it just becomes "throw a few writes in here, don't care about the result, push me to the final timestamp". This means to do this optimally, we want to continue past transaction retry errors (to get all of the locks in place), and ideally we're even placing the locks in parallel or the latency cost could be significant. In lieu of that, we need to put all SFUs into a single statement at the beginning of the txn, so that it at least will retry until it's got them all. At which point it becomes more cumbersome to actually use the results if the app's txn logic is non-trivial, especially if you're just using some option in your ORM.

This all ties in with my doubt that we *really* understand how SFU in CRDB will help avoid retries. I still think that it will, but we need to do our legwork to make sure this is reasonably straightforward. (I haven't caught up on the SFU RFC, so apologies if I'm beating a dead horse. I trust that you and Raphael will think this all through and have answers in the RFC).

Raphael 'kena' Poss

unread,

Nov 1, 2019, 8:29:19 AM11/1/19

to Tobias Grieger, Andrei Matei, Ben Darnell, CockroachDB, KV, Spencer Kimball

On 01-11-2019 11:20, Tobias Grieger wrote:

(I haven't caught up on the SFU RFC, so apologies if I'm beating a dead horse. I trust that you and Raphael will think this all through and have answers in the RFC).

wut?

I don't know of a SFU RFC and even less that I was involved. When is this due?

(My hands are a bit full with savepoints at this moment.)

In the context of savepoints I am currently merely drawing an inventory of the various kinds of errors and what txn state can be restored during a savepoint rollback (partial txn rollback).

There will be "interesting" questions when SFU work will start _after_ savepoints are a thing, but I hope I can keep that out of scope at least for one more month.

-- 
Raphael 'kena' Poss

Tobias Grieger

unread,

Nov 1, 2019, 8:43:00 AM11/1/19

to Raphael 'kena' Poss, Andrei Matei, Ben Darnell, CockroachDB, KV, Spencer Kimball

Sorry Raphael, interesting mix-up on my end. Understanding SFU is Nathan and myself.

Sumeer Bhola

unread,

Nov 3, 2019, 1:49:25 PM11/3/19

to Tobias Grieger, Nathan VanBenschoten, Ben Darnell, Andrei Matei, CockroachDB, KV, Spencer Kimball

This thread reminded me of a related question that I had forgotten to ask, about what timestamp we use when updating the timestamp cache.

First, making sure my mental model is not broken, wrt advancing the timestamp of a transaction eagerly (and read refreshing or retrying) or deferring it:

2 kinds of timestamp events that cause txn timestamp to need to advance:

- WriteTooOld errors are because a write exists at a higher timestamp. Typically the txn has done a read of that key previously, so eagerly retry since the previous read is invalid at the higher timestamp, and eager retry possibly allows for auto retry (and not a client-side retry).
- Pushes caused by trying to write below the timestamp in the timestamp cache: we defer handling of pushes since these pushes happen more often, and if we push eagerly we have to refresh all preceding reads which is potentially O(n^2) with n reads in a txn. So we hope for the best (in the absence of read locks) that read refreshes will succeed at commit time.
Is that correct?

Say the txn tried to write to key k1, at timestamp t1, and needs to advance to t2 to do the write (WriteTooOld or push), the following may have happened previously in this txn:
- reads of other keys (k2): If hold read lock we know a read refresh will succeed. If don't have read locks we could update timestamp caches to t2 for these reads to make it more likely that the read refresh will succeed in the future (this is a best-effort update that can happen concurrently with the transaction continuing to proceed).

- write of k2: we have a write intent preventing later writes so doesn't require eager handling.

- read of k1: if there is a write by some other transaction in [t1, t2] the txn should eagerly retry since the read refresh will fail. The closed timestamp example above seems to be one where < t2 is closed but there isn't a write in the [t1, t2] interval, so it is unclear to me why this requires eager handling. Also, I don't quite understand how read locks work with follower reads, so I am going to pretend that read refreshes are not cheap, so would like to defer them unless necessary.

If we deferred the read refreshes (and we don't have read locks), for future reads this transaction could do the following:

- continue reading at t1 but confirm that there was no write by another transaction in [t1, t2]. If the latter is not true, it needs to be restarted. Based on reading mvcc.go and related stuff, I don't think we look for the absence of writes in [t1, t2], but I could very well be mistaken (reads are using Transaction.MaxTimestamp and Transaction.OrigTimestamp, but there is also some use of Transaction.RefreshedTimestamp which I did not fully understand).

- if the previous step is ok, advance timestamp cache to t2 (not t1). I couldn't figure out by looking at the tsCache.Cache.Add() calls whether we do this.

--
You received this message because you are subscribed to the Google Groups "KV" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kv+unsu...@cockroachlabs.com.
To view this discussion on the web visit https://groups.google.com/a/cockroachlabs.com/d/msgid/kv/CAHkUbgNn%3D%2BrgT9_JP7up2U3KZ%2Bc45fvJYVVmisb6QAzByPL2Zw%40mail.gmail.com.

Tobias Grieger

unread,

Nov 4, 2019, 5:07:25 AM11/4/19

to Sumeer Bhola, Nathan VanBenschoten, Ben Darnell, Andrei Matei, CockroachDB, KV, Spencer Kimball

> This thread reminded me of a related question that I had forgotten to ask, about what timestamp we use when updating the timestamp cache.
> First, making sure my mental model is not broken, wrt advancing the timestamp of a transaction eagerly (and read refreshing or retrying) or deferring it:

Reality in today's code is ...muddled. For example, we use the *write* timestamp to update the read timestamp cache, where it seems like we should instead use the read timestamp[1].
There's also ReadWithinUncertaintyIntervalError, which in practice works out like a WriteTooOldError in the sense that the read timestamp needs to move up.

> - WriteTooOld errors are because a write exists at a higher timestamp. Typically the txn has done a read of that key previously, so eagerly retry since the previous read is invalid at the higher timestamp, and eager retry possibly allows for auto retry (and not a client-side retry).

Correct. We have a few options:
1a) lay down an intent (read from the original read timestamp, won't be able to commit this intent) and restart at the end (i.e. the thing we generally don't seem to like any more)
1b) error out right away, i.e. don't even lay down an intent
1c) lay down an intent, but then restart immediately
2) evaluate the command with an updated read (and write) timestamp so that it sees the latest write. If the value changes in the process, the caller has to refresh before returning the result to the client.

It's worth pointing out in 1a) that if the key is never read by the txn, the intent we're laying down can commit, and we don't have to force the refresh right there; we could go optimistic and refresh once at the end. I *think* (but am not sure) that SQL uses `CPut` most of the time though, where this is never possible (since it's a read also). If we find that we're doing "pure" writes frequently, we should be able to optimize this - we track the refresh spans, which can give us false positives on whether we've previously read a given key, but never false negatives.

I am apprehensive about choosing 1b) since that will lead to starvation. Andrei's suggestion is really 1c) which works for me.

> - Pushes caused by trying to write below the timestamp in the timestamp cache: we defer handling of pushes since these pushes happen more often, and if we push eagerly we have to refresh all preceding reads which is potentially O(n^2) with n reads in a txn. So we hope for the best (in the absence of read locks) that read refreshes will succeed at commit time.
> Is that correct?

I think this is correct. The refresh handler activates on transaction retry errors[2], so unless we're turning pushed write timestamps eagerly into those at the replica level (which I don't recall us doing), we're deferring to commit time. Though I'm not sure if we're maybe going to the leaseholder to find out about the error (thus wasting a round trip), instead of refreshing eagerly?

> Say the txn tried to write to key k1, at timestamp t1, and needs to advance to t2 to do the write (WriteTooOld or push), the following may have happened previously in this txn:
> - reads of other keys (k2): If hold read lock we know a read refresh will succeed. If don't have read locks we could update timestamp caches to t2 for these reads to make it more likely that the read refresh will succeed in the future (this is a best-effort update that can happen concurrently with the transaction continuing to proceed).

A refresh updates the timestamp cache, so if we're sending async timestamp cache updates, we might as well do the refresh eagerly (but in the background). The main reason we haven't explored this yet is because listening for the results of async requests requires changes to the transaction gateway, and because we haven't convinced ourself that it's overall a net win to add it. But it's certainly an option.

I got a little confused reading through your example but overall I think the gist is right.

> - write of k2: we have a write intent preventing later writes so doesn't require eager handling.

The intent also prevents later reads, which means that we'll be able to refresh this key.

> - read of k1: if there is a write by some other transaction in [t1, t2] the txn should eagerly retry since the read refresh will fail.

Correct.

> The closed timestamp example above seems to be one where < t2 is closed but there isn't a write in the [t1, t2] interval, so it is unclear to me why this requires eager handling.

It does not require eager handling, we can do it eagerly or not. I don't think we have a good reason to believe one to be better than the other (in the general case), though there's a strong reason to defer these when we optimize for global transactions.

> Also, I don't quite understand how read locks work with follower reads, so I am going to pretend that read refreshes are not cheap, so would like to defer them unless necessary.

I didn't mean to imply that refreshes become cheap as in free via read locks. They become cheaper because they only check a read lock as opposed to scanning all data in the refresh window. But the network latency stays the same, and that's where the real cost lies.
Nobody quite knows how read locks are going to work, though Nathan and I have a good idea. For the purposes of semantics, it's reasonable to think of them as a ranged intent that is ignored by read operations, which is sent to the leaseholder for replication async (unless the leaseholder is contacted directly anyway), similar to a pipelined write.

> If we deferred the read refreshes (and we don't have read locks), for future reads this transaction could do the following:
> - continue reading at t1 but confirm that there was no write by another transaction in [t1, t2]. If the latter is not true, it needs to be restarted. Based on reading mvcc.go and related stuff, I don't think we look for the absence of writes in [t1, t2], but I could very well be mistaken (reads are using Transaction.MaxTimestamp and Transaction.OrigTimestamp, but there is also some use of Transaction.RefreshedTimestamp which I did not fully understand).
> - if the previous step is ok, advance timestamp cache to t2 (not t1). I couldn't figure out by looking at the tsCache.Cache.Add() calls whether we do this.

Yes, we could do this. On writes WriteTooOldError does exactly this (it assumes there's a read, which isn't always true), and it's also not needed for correctness (used to be in SNAPSHOT, but we don't have that any more); similarly ReadWithinUncertaintyIntervalError which protects `[OrigTimestamp, MaxTimestamp]`. I don't know if it's beneficial overall, though. If that check never does anything we've added sync work that could be parallelized before the commit. But we're also saving extra RPCs (i.e. removing work from the system), so it could be a net win either way.

[1]: https://github.com/cockroachdb/cockroach/blob/2c0e2ab3d2f2faa664c4cc809c9b990c2519a773/pkg/storage/replica_tscache.go#L50-L56
[2]: https://github.com/cockroachdb/cockroach/blob/82f0c4f8c9313663b9d84c7a157c8eb9ed851feb/pkg/roachpb/data.go#L1276-L1280

Tobias Grieger

unread,

Nov 4, 2019, 5:30:07 AM11/4/19

to Sumeer Bhola, Nathan VanBenschoten, Ben Darnell, Andrei Matei, CockroachDB, KV, Spencer Kimball

Andrei, if you remove WriteTooOld but continue laying down an intent before the restart to avoid starvation, what is your plan for doing that? It seems unsatisfying to return WriteTooOldError but to lay down the intent at the same time because then we'll have to distinguish between errors that do and those that don't throughout the Replica code. Similar to points I've made earlier and the footnote in your email, communicating transaction state through errors seems awkward there.

When you take a step back, the WriteTooOld flag really is "you will fail the refresh". A ReadWithinUncertaintyIntervalError communicates the same. It seems more natural to remove WriteTooOldError and ReadWithinUncertaintyIntervalError instead of removing the WriteTooOld flag (which we'll rename instead). We want the gateway to run its transaction, and with each response it can decide what to do (before showing the result to the client). Your suggestion is "keep running until we learn that we'll fail a refresh", which it can easily implement by looking at the read and write timestamps as well as the flag; deciding instead to run the transaction all the way through is also simple to implement. (The code in EndTransaction goes away anyway). To add a devil's advocate argument, we're returning one batch of results the client won't use. (This last argument will fall apart to some extent as we get into optimistic reads and/or writes; it'll necessarily be more common to do extra work in the unhappy case). One other argument for keeping the errors is that they can be counted and shown to clients, to maybe explain their transaction to them. (Though this is better done by keeping a log at the gateway that interprets the state changes.)

Andrei Matei

unread,

Nov 4, 2019, 11:01:28 AM11/4/19

to Tobias Grieger, Sumeer Bhola, Nathan VanBenschoten, Ben Darnell, CockroachDB, KV, Spencer Kimball

Andrei, if you remove WriteTooOld but continue laying down an intent before the restart to avoid starvation, what is your plan for doing that? It seems unsatisfying to return WriteTooOldError but to lay down the intent at the same time because then we'll have to distinguish between errors that do and those that don't throughout the Replica code.

My plan, to the extent I had any, was to modify the code at the highest level of evaluation, around here, to recognize WriteTooOldError and not reset the Replicated eval result (so that the intent gets written). I hope at that level I then both return an error and a non-trivial evaluation result.

Similar to points I've made earlier and the footnote in your email, communicating transaction state through errors seems awkward there.
When you take a step back, the WriteTooOld flag really is "you will fail the refresh".

Well, not exactly. In fact, currently we do refresh on WriteTooOldErrors, although I had added a note about the refresh being likely to fail. If the write was a blind write from the client txn's perspective, the refresh can succeed. So another, orthogonal, improvement I'm intending to make is using the client's data about previous reads to determine when a refresh caused by a wto condition is doomed to fail.

A ReadWithinUncertaintyIntervalError communicates the same. It seems more natural to remove WriteTooOldError and ReadWithinUncertaintyIntervalError instead of removing the WriteTooOld flag (which we'll rename instead). We want the gateway to run its transaction, and with each response it can decide what to do (before showing the result to the client). Your suggestion is "keep running until we learn that we'll fail a refresh", which it can easily implement by looking at the read and write timestamps as well as the flag; deciding instead to run the transaction all the way through is also simple to implement. (The code in EndTransaction goes away anyway). To add a devil's advocate argument, we're returning one batch of results the client won't use. (This last argument will fall apart to some extent as we get into optimistic reads and/or writes; it'll necessarily be more common to do extra work in the unhappy case). One other argument for keeping the errors is that they can be counted and shown to clients, to maybe explain their transaction to them. (Though this is better done by keeping a log at the gateway that interprets the state changes.)

I'll think about this some more. You say communicating txn state updates through errors is awkward. The way I see it, the alternative - communicating them through an updated txn record - is even more awkward. That txn proto, as it stands today, is a big bag of fields and it's unclear which ones are updated by the client, by the server, by both, etc. But we also already have the Timestamp field which is somewhat redundant with the WriteTooOld one (as Nathan also points out), so... I dunno, I'll think more about what I think is cleanest.

Tobias Grieger

unread,

Nov 4, 2019, 11:49:52 AM11/4/19

to Andrei Matei, Sumeer Bhola, Nathan VanBenschoten, Ben Darnell, CockroachDB, KV, Spencer Kimball

> The way I see it, the alternative - communicating them through an updated txn record - is even more awkward

I 100% agree with that. You've pushed for returning a "delta" instead in the past and that's what I would prefer in the medium term.

I'm also not sure I have all of the answers here. There are multiple concerns:

- communicate "unstructured" errors (context canceled, ...)

- communicate txn state changes (ReadWithinUncertaintyInterval, BatchTimestampBeforeGCThresholdError, WriteTooOldError)

- communicate KV read results (ConditionFailedError)

- don't do extra DistSender work on some errors (right now: all errors), but also:

- keep refresh spans from partial execution (my favorite pet peeve with the txn refresher)

It seems that a lot of confusion and FUD would go away if we had clear distinctions between classes of information exchanges. I'd hazard a guess that Raphael would agree that the status quo is quite confusing.

> I hope at that level I then both return an error and a non-trivial evaluation result.

I hope this turns out better than it sounds.

Unrelated performance nugget that Andrei and I just chatted about privately: option 2), where it's possible, has an advantage because by not rewriting the intent we avoid pipeline stalling ourselves. This can help improve performance in contending workloads. Naively, each write would go through two rounds of consensus on the contended key (on top of its other refresh latencies) and the second round stalls on the first. If we write only once in the right place, we avoid this double stall.

Andrei Matei

unread,

Nov 23, 2019, 4:26:02 PM11/23/19

to Tobias Grieger, Sumeer Bhola, Nathan VanBenschoten, Ben Darnell, CockroachDB, KV, Spencer Kimball

I'm back here, and I think my thinking has evolved.

Tobi, at some point above you were rationalizing, and enumerated some options around dealing

Correct. We have a few options:
1a) lay down an intent (read from the original read timestamp, won't be able to commit this intent) and restart at the end (i.e. the thing we generally don't seem to like any more)
1b) error out right away, i.e. don't even lay down an intent
1c) lay down an intent, but then restart immediately
2) evaluate the command with an updated read (and write) timestamp so that it sees the latest write. If the value changes in the process, the caller has to refresh before returning the result to the client.

Andrei Matei

unread,

Nov 23, 2019, 6:39:30 PM11/23/19

to Tobias Grieger, Sumeer Bhola, Nathan VanBenschoten, Ben Darnell, CockroachDB, KV, Spencer Kimball

(trying the last message again)

I'm back here, and my thinking has evolved. I'm finding it quite challenging to think about eager vs deferred refresh/restart in a structured way, particularly when the thinking intersects with CPut behavior. But this is my attempt.

Tobi, at some point above you were rationalizing, and enumerated some options around dealing with either general WriteTooOld conditions in some specific circumstances. Let's unpack it.

Correct. We have a few options:
1a) lay down an intent (read from the original read timestamp, won't be able to commit this intent) and restart at the end (i.e. the thing we generally don't seem to like any more)
1b) error out right away, i.e. don't even lay down an intent
1c) lay down an intent, but then restart immediately
2) evaluate the command with an updated read (and write) timestamp so that it sees the latest write. If the value changes in the process, the caller has to refresh before returning the result to the client.

I'm now thinking that option 1c) is sane sometimes, but other times option 2) is the only sane thing to do.

Let's discuss 3 distinct scenarios:

Scenario A: a write of key k (Put) that encounters a later write (i.e. "WriteTooOld"), and the key k had not been previously read by the txn. This is truly a blind write.

In this case, the WriteTooOld condition is not proof that a refresh will fail later. So I now think that this should be handled like the other situations when write timestamp is advanced (i.e. by the read timestamp cache) - we should continue the transaction. Why wouldn't we? With an intent in place, naturally - if we end up committing, then we need that intent to commit, and if we end up restarting later this intent acts as a lock).

Scenario B: a write of key k (Put) that encounters a later write (i.e. "WriteTooOld"), and the key k had been previously read by the txn. This is morally not a blind write.

In this case, the WriteTooOld condition is proof that a refresh will fail later. Resistance is futile. We should restart the transaction immediately. If we're lucky, we will be able to auto-retry (if we haven't streamed any results for the txn to the client yet). If the client will auto-retry (i.e. in case of SQL transactions, if we're likely to do auto-retries or if the client is using the client-directed retries protocol (SAVEPOINT cockroach_restart)), then we should leave a lock behind (currently, an intent since intents are the only types of locks that we have; perhaps tomorrow we'll have other types of locks where the "value" of the lock doesn't matter - SFU style).
Leaving an intent behind means that, upon restart, we'll pipeline-stall ourselves. But I hope that, in the context of SFU, we can come up with a design for these locks such that we do not stall ourselves...

Scenario C: a CPut on key k that encounters a later write (more precisely, a write above the CPut's read timestamp). This is a read+write.

It seems to me that you can see scenario C like either scenario A or scenario B, depending on whether k had been read during the transaction before this request (let's call CPut + k not read before scenario C1, and CPut + k read before scenario C2). Regardless, one thing that now seems clear to me is that we should evaluate the request at an advanced read timestamp. Evaluating the request at the old timestamp (as we do now) is probably a bad idea. Let's say that the CPut's read ts is ts1 and it encounters a write at ts2 > ts1.

- if we evaluate at ts1, the result of the evaluation cannot be committed. However, returning such a result would allow the transaction to continue without an immediate refresh and restart later, since it hasn't broken the "consistent snapshot" that the txn is reading. It would force us to either retry the CPut (if a refresh succeeds) or to restart the txn at some later point, however. And when we do either of these, we'll pipeline-stall ourselves. So, particularly if the refresh succeeds, the CPut is retried immediately and the pipeline stall is guaranteed.
- if we evaluate at t2, then the result of the evaluation can be used in conjunction with a successful refresh. A successful refresh is not possible if k has been read before by the txn. So, my proposal is to discriminate for CPut between C1 and C2. (which discrimination we can do because we have the txn's read set). For C1, (refresh possible), then we evaluate the CPut at ts2 and force the client to refresh immediately. If it's C2 (refresh not possible), then whether we evaluate at ts1 or ts2 doesn't matter. We should leave a lock behind and restart the txn immediately.

In scenario C1, we want the server to return to the client both the result of evaluating the request, and information about the need to refresh immediately. In other words, we want to return a speculative result. Which suggests that my quest to get rid of the WriteTooOld flag was misguided all along. What we want to get rid of instead is the WriteTooOldError (and the ReadWithinUncertaintyIntervalError). I now think we should introduce the idea of speculative results as first-class citizens. The exact way in which a result is to be marked as speculative I'm not sure yet, but it seems that using any sort of error for that is the wrong way to go.

To deal with the chaos in this thread full of tradeoffs between optimism and pessimism, here's the principles I propose:

At the KV level:

Greedily try to avoid txn restarts. Meaning, if there's a chance that by refreshing eagerly we avoid a guaranteed later restart, let's take that option.
If we don't have particular reason to believe that a refresh will fail, let's assume that it will succeed.
This leads to some decisions:
a) Don't refresh eagerly in the case of a write running into a timestamp cache or a blind-write (scenario A). We can refresh later after we possibly get pushed some more, and hopefully we'll succeed just the same.
b) (in conjunction with point 1) above) Force a refresh in scenario C1. Meaning, execute the CPut speculatively at ts2, which forces us to refresh immediately. If the refresh is successful, the CPut doesn't need to be retried.
Note that, if we were to think that refreshes mostly fail, we'd be better off not doing speculative execution, instead preferring to retry the CPut if the refresh succeeds and keeping the door open to continuing the txn if the refresh fails.
c) Return speculative results and force immediate refreshes for ReadWithinUncertaintyInterval conditions. If the refresh is successful, the read doesn't need to be retried.
Similarly to b), if we were pessimistic about the refresh's chances, we'd be better off not doing speculation and cutting the read's execution short, returning an error, and opting to retry the read if the refresh is successful.
When we know that refreshing is impossible, restart eagerly. Meaning, restart eagerly in scenarios B and C2.

At the SQL level:

While we can auto-retry, let's prefer restarting the transaction to risking losing the ability to do automatic restarts in favor of acquiring more locks. Meaning, at the end of a statement, if we haven't streamed results back to the client, do eager refreshes if the txn's write timestamp has diverged from the txn's read timestamp, and, if the refresh failed, restart eagerly (this latter part is equivalent to point 3) above).

- a_m

Reply all

Reply to author

Forward