timestamp cache availability

39 views
Skip to first unread message

Unmesh Joshi

unread,
Sep 30, 2021, 12:10:36 PM9/30/21
to cockroach-team
Hi,

Timestamp cache is critical in providing the guarantee that no writes happen before the latest read timestamp for a given key. So I imagine it to be highly available.
But I see that the timestamp cache is updated and checked before the client request is replicated through raft. How is the availability of the timestamp cache maintained then? What if the node holding the latest timestamp information crashes?
I am sure I'm missing something obvious here.

Thanks,
Unmesh

Andrew Werner

unread,
Sep 30, 2021, 12:16:04 PM9/30/21
to CockroachDB
You are correct that the timestamp cache does not get persisted. In fact, all range movement and lease transfers leads to the timestamp cache information getting lost. Fortunately, the timestamp cache can be viewed as an optimization. If, say, we just assume that the entire keyspace was just read, we would avoid any risk of history rewriting. This would be a painful assumption because it would mean that all transactions which write more than once would probably be pushed.

That's the tradeoff we make in the face of lease transfers (which are relatively rare). It's also a tradeoff we make in the face of memory pressure. Memory is not infinite, so we periodically throw away timestamp cache data and assume a high watermark for readers which we know to be safe. For lease transfers, we know that the lease start time is a safe timestamp. 

Unmesh Joshi

unread,
Oct 1, 2021, 8:20:03 AM10/1/21
to Andrew Werner, CockroachDB
I see that maintaining a timestamp cache is more of an optimization and it is ok to reinitialise it as long as the low water mark is set to the server start/initialization time?
The alternative could do all the writes at the latest timestamp? Because each server's hlc is updated and adjusted at every request, both Get and Put, the timestamp assigned to a put request is guaranteed to be after any get request which happened before.
I will recheck, but I guess Spanner writes (and even Percolator writes) happen this way with writes happening at the latest server timestamp and then coordinator picking up the latest timestamp across the participating servers at the commit time. 
So I see two requirements
1. Each request (get or put) should adjust the server hlc with the most up-to-date timestamp
2. Each put should advance the hlc and use it as a timestamp of the put. So the txn timestamp is advanced on every put.

This way the write-follows-read and snapshot isolation guarantee can be provided even without a timestamp cache. ? 

--
You received this message because you are subscribed to the Google Groups "CockroachDB" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cockroach-db...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/cockroach-db/8d584e87-1c4e-4750-aa1a-4bf17b6ca7a6n%40googlegroups.com.

Unmesh Joshi

unread,
Oct 3, 2021, 7:09:37 PM10/3/21
to Andrew Werner, CockroachDB
The essential question I am trying to figure out is 'What if there is no timestamp cache and we advance the write timestamp for every write'? 

Andrew Werner

unread,
Oct 3, 2021, 7:09:41 PM10/3/21
to Unmesh Joshi, CockroachDB
Then it may be expensive to maintain serializability. If you only want snapshot isolation then it’s no problem, hence percolator. An alternative choice is pessimism. If all writing transactions take read lock on reads, then they can commit at whatever timestamp they want because they know their reads won’t have been invalidated. That’s the approach taken by Spanner. Cockroach is more optimistic on reads, it doesn’t take locks on maintain state outside the timestamp cache. The timestamp cache allows most transactions to commit at their original read timestamp. Transactions which commit at their original timestamp don’t need to perform a “refresh” upon commit which ensures that reads have not been invalidated. 

In spanner, reads on behalf of rw transactions  take in-memory locks. When the transaction goes to commit, it broadcasts its buffered writes and concurrently checks that all of its locks still exist. This is almost a refresh but it’s cheaper because it doesn’t need to go to disk. It almost makes the whole potentially higher latency due to the 2PC that can’t have been pipelined during execution. 

In practice, there’s tradeoffs. Pessimism is better if your workload is more contended. We contemplate adding pessimism in various ways. We probably wouldn’t get rid of the timestamp cache. I think the thing we’d consider doing less of is eager replication of uncommitted data. 

Unmesh Joshi

unread,
Oct 3, 2021, 7:09:43 PM10/3/21
to Andrew Werner, CockroachDB
>>The timestamp cache allows most transactions to commit at their original read timestamp. Transactions which commit at their original >> timestamp don’t need to perform a “refresh” on commit which ensures that reads have not been invalidated. 
I think the refresh is an optimization to avoid full restart of the transaction? Otherwise, a write or commit can always check if there are newer writes which have happened between the original transaction timestamp and the timestamp of a particular write? 
That way, the requirement to always do every write at an incremented hlc timestamp and commit at the highest hlc value of all the writes is enough to provide serializable isolation, because only of the overlapping transactions will succeed and others will always restart from start?
Thank you for all the answers on this forum, transactions and time in distributed systems is difficult to think about alone.

Reply all
Reply to author
Forward
0 new messages