Resolving Write Intents when Txn Coordinator fails

Unmesh Joshi

unread,

Sep 7, 2021, 11:59:46 AM9/7/21

to cockroach-team

Hi,

I was going through transaction implementation in CockroachDb. I think that the transaction implementation here is interesting with usage of Write Intents and Transaction record liveness managed by heartbeats. One of the missing links for me is to figure out what happens when a transaction coordinator fails. (The TxnCoordinatorSender responsible for sending heartbeats). The Transaction record will timeout. But who triggers the resolution of the write intent? More importantly, what if the coordinator fails before write intents are solved?

Thanks,

Unmesh

Andrei Matei

unread,

Sep 7, 2021, 2:42:54 PM9/7/21

to Unmesh Joshi, cockroach-team

Hey,

Cleaning up abandoned intents is done by conflicting readers and writers. Whenever someone tries to access a key on which there is an intent, the intent needs to be "resolved" before the respective access can proceed. As a last resort, the garbage collection queue will come around every few hours and clean up all intents that nobody else ran into already (this queue will also delete the abandoned transaction records).

The process of resolving an intent goes in two stages:
1) The status of the intent's transaction needs to be determined. This is done through the PushTxnRequest. This request can observe that a transaction record is expired because it hasn't been heartbeated recently (*) and move the transaction record to the ABORTED state.

2) The ResolveIntentRequest is the one that either turns an intent into either a committed value, or deletes it depending on the txn's status.

(*) The heartbeat is not the only signal taken into consideration. If the intent triggering the resolution process is recent, that is also proof that the coordinator is still alive (or at least was alive recently).

Hope this helps,

- Andrei

--
You received this message because you are subscribed to the Google Groups "CockroachDB" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cockroach-db...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/cockroach-db/CAOk%2BzfcLGOwYqcTpGHpdT%3DkHdmJ94tTD2SD_FiEr%2BvBUNe3sDQ%40mail.gmail.com.

Unmesh Joshi

unread,

Sep 8, 2021, 9:39:59 AM9/8/21

to Andrei Matei, cockroach-team

Thanks, this is useful. So the TxnCoordSender which detects an incomplete write intent. it will first complete that previous transaction and then restart the ongoing transaction right?

Consider the following scenario.

TxnCoord1

ID txn1

write 'name' 'Alice'

write 'title' 'Nitroservices'

while committing, if txn1 fails after storing the Transaction record with state 'committed', but before resolving the write intent.

TxnCoord2

ID txn2

write 'color' 'blue'

write 'title' 'Microservices'

In this case the write to 'title' detects an unresolved write intent. So it needs to do two things.

1. Detect the status of the transaction txn1 by inspecting the Transaction record. (In this case it was committed). So send PushTxnRequest to the key range owning the Transaction record.
2. Resolve the write intent.

After this is done, restart txn2.

Looking at the code, it looks like this is how it will work. Right?

So the failed coordinator case is handled by completing the incomplete transactions by the coordinator responsible for subsequent transactions.

Thanks,

Unmesh

Andrei Matei

unread,

Sep 8, 2021, 10:37:56 AM9/8/21

to Unmesh Joshi, cockroach-team

On Wed, Sep 8, 2021 at 7:50 AM Unmesh Joshi <unmes...@gmail.com> wrote:

Thanks, this is useful. So the TxnCoordSender which detects an incomplete write intent. it will first complete that previous transaction and then restart the ongoing transaction right?

Consider the following scenario.

TxnCoord1
ID txn1
write 'name' 'Alice'
write 'title' 'Nitroservices'

while committing, if txn1 fails after storing the Transaction record with state 'committed', but before resolving the write intent.

TxnCoord2
ID txn2
write 'color' 'blue'
write 'title' 'Microservices'

In this case the write to 'title' detects an unresolved write intent. So it needs to do two things.

1. Detect the status of the transaction txn1 by inspecting the Transaction record. (In this case it was committed). So send PushTxnRequest to the key range owning the Transaction record.
2. Resolve the write intent.

After this is done, restart txn2.

Looking at the code, it looks like this is how it will work. Right?

That's right. A few details:

"Detecting the status of the transaction" needs to handle a few cases:

1) There is a COMMITTED txn record. This is a trivial case -> the txn is committed and its intents can be committed.

2) There is an ABORTED txn record. This is a trivial case -> the txn is aborted and its intents can be deleted.

3) There is a PENDING txn record. The pusher decides whether the txn should be considered abandoned and, if it is, it moves the txn record to ABORTED.

4) There is no txn record because it hasn't been written yet (a PENDING txn record is only written 1s into the life of a txn, as an optimization). This is morally equivalent to the case above - the pusher has a choice of whether the txn should be considered abandoned. If it decides that it is abandoned, then the pusher needs to ensure that the txn will not become committed later. For technical reasons it does not do this by creating an aborted txn record; instead, it populates a marker in the in-memory Timestamp Cache structure.

5) There is a STAGING txn record. This case is the trickiest: in this case the txn is already either committed or aborted, but the pusher doesn't know which one it is. In order to discover that, it engages in the "transaction recovery protocol", which involves reading some of the transaction's intents to check if they successfully replicated. Once it figures it out, it moves the txn record to the resulting state.

In your example, you've noted that txn2 restarts after resolving the conflicting intent. That's not necessarily the case (in fact, it commonly isn't). Whether txn2 needs to restart or not depends on how its provisional timestamp is ordered versus the timestamp of txn1. If txn2 is writing at a higher timestamp than txn1, there's no reason for the restart; the txn2 simply overwrites the "title" key and continues.

- Andrei

Unmesh Joshi

unread,

Sep 13, 2021, 10:04:15 AM9/13/21

to Andrei Matei, cockroach-team

Hi Andrei,

Parallel commit and the Transaction record in STAGING state is interesting. I see that before the parallel commit feature was implemented, there was no need for storing keys for all the writes in the Transaction record and the write intents could be resolved for each key independently by the future transactions touching a specific key. Not sure if there were some subtle issues with this? Particularly with interleaving read-write and read-only transactions? (Possibly not, depending on when the push transaction requests are considered successful. If they always succeed for the Transaction record which is marked either Aborted or Committed by earlier PushTxnRequest, the write intents will always be resolved)

I see that parallel commit was done specifically to reduce the latency caused by waiting for two round trips (one for writing the transaction record and one for the actual key value (which happens in parallel for multiple keys)).

The same could be achieved if the Transaction record keeps track of all the keys, and adding the state PREPARING_TO_COMMIT / ABORT before COMMIT, and then delegate the responsibility of resolving the intents to the leader of the range holding the Transaction record? The leader of that range can always keep trying sending ResolveWriteIntent requests. It can also keep track of transactions which are preparing to commit or preparing to abort when the leader failover happens, and keep sending the ResolveWriteIntent requests. The use of the STAGING state does something similar I guess, But I was just thinking, what if the range holding the Transaction record acts as a transaction coordinator?

Thanks,

Unmesh

Andrei Matei

unread,

Sep 13, 2021, 11:22:27 AM9/13/21

to Unmesh Joshi, cockroach-team

On Mon, Sep 13, 2021 at 9:19 AM Unmesh Joshi <unmes...@gmail.com> wrote:

Hi Andrei,

Parallel commit and the Transaction record in STAGING state is interesting. I see that before the parallel commit feature was implemented, there was no need for storing keys for all the writes in the Transaction record and the write intents could be resolved for each key independently by the future transactions touching a specific key.

That doesn't sound right. I think we always had the lock_spans field (used to be called intent_spans) in the Transaction proto. I think the transaction record for a finalized transaction always contained enough information to cleanup all intents.

Not sure if there were some subtle issues with this? Particularly with interleaving read-write and read-only transactions? (Possibly not, depending on when the push transaction requests are considered successful. If they always succeed for the Transaction record which is marked either Aborted or Committed by earlier PushTxnRequest, the write intents will always be resolved)

I'm sorry but I'm not sure what you're saying / asking here.

I see that parallel commit was done specifically to reduce the latency caused by waiting for two round trips (one for writing the transaction record and one for the actual key value (which happens in parallel for multiple keys)).

The same could be achieved if the Transaction record keeps track of all the keys, and adding the state PREPARING_TO_COMMIT / ABORT before COMMIT, and then delegate the responsibility of resolving the intents to the leader of the range holding the Transaction record? The leader of that range can always keep trying sending ResolveWriteIntent requests. It can also keep track of transactions which are preparing to commit or preparing to abort when the leader failover happens, and keep sending the ResolveWriteIntent requests. The use of the STAGING state does something similar I guess, But I was just thinking, what if the range holding the Transaction record acts as a transaction coordinator?

"What if" questions are hard to answer without more description of what the goal is. One thing to keep in mind that perhaps helps is that, on the happy path, the transaction coordinator (i.e. the TxnCoordSender, living on the gateway node processing a SQL transaction) is the first actor to become aware about whether a STAGING transaction is actually committed or not, because it needs to figure out that in order the return a commit acknowledgement to the client application. So, since it needs to figure this out anyway (which it does through QueryIntent queries), it's in a good position to deal with cleaning up intents and performing the STAGING->COMMITTED transition for the transaction record.

Cheers,

- Andrei

Unmesh Joshi

unread,

Sep 14, 2021, 8:12:06 AM9/14/21

to Andrei Matei, cockroach-team

On Mon, Sep 13, 2021 at 8:31 PM Andrei Matei <and...@cockroachlabs.com> wrote:

That doesn't sound right. I think we always had the lock_spans field (used to be called intent_spans) in the Transaction proto. I think the transaction record for a finalized transaction always contained enough information to cleanup all intents.

The intent spans will be populated only on the EndTransaction request. And I think they are important to make sure committed transaction records are not garbage collected before all the affecting intents are resolved? So if there is no GC, it is ok to just have a transaction record with transaction status, and all the pending write intents can be independently resolved by the subsequent (read or write) transactions which involve the keys of the pending intents.

Apologies, for some confusing questions/comments. But some of this is coming from my attempt to write code from scratch for a very thin thread functionality of two phase commit to demonstrate how it works in CockroachDB. I am using CockroachDB, Kafka, DeltaLake and Pivotal Gemfire (as an example of JTA) to build example code for two phase commit-like implementations to explain similarities and differences for the patterns I am working on.

Andrei Matei

unread,

Sep 14, 2021, 10:14:22 AM9/14/21

to Unmesh Joshi, cockroach-team

On Mon, Sep 13, 2021 at 10:53 PM Unmesh Joshi <unmes...@gmail.com> wrote:

On Mon, Sep 13, 2021 at 8:31 PM Andrei Matei <and...@cockroachlabs.com> wrote:

That doesn't sound right. I think we always had the lock_spans field (used to be called intent_spans) in the Transaction proto. I think the transaction record for a finalized transaction always contained enough information to cleanup all intents.

The intent spans will be populated only on the EndTransaction request. And I think they are important to make sure committed transaction records are not garbage collected before all the affecting intents are resolved? So if there is no GC, it is ok to just have a transaction record with transaction status, and all the pending write intents can be independently resolved by the subsequent (read or write) transactions which involve the keys of the pending intents.

I think this is right. So, yeah, you can see the STAGING state as requiring that the txn record store more info than before. Note that it doesn't need to store "all the writes", only the writes that were in-flight at the time when the STAGING record was written.

Reply all

Reply to author

Forward