Strongly consistent local reads question

Seyed Hossein Mortazavi

unread,

Jul 5, 2019, 11:51:41 PM7/5/19

to CockroachDB

Hey everyone,

Reading the awesome blog post here:

https://www.cockroachlabs.com/blog/consensus-made-thrive/ and https://www.cs.cmu.edu/~dga/papers/leases-socc2014.pdf

I wonder if we could have local strongly consistent reads if a (modified) Raft leader would write to all nodes rather than a quorum of nodes and just block in case of failures. Am I missing something here with Raft?

Tobias Grieger

unread,

Jul 6, 2019, 2:20:20 PM7/6/19

to Seyed Hossein Mortazavi, CockroachDB

Hi Seyed,

no, that sounds somewhat right (you need things to be applied, not only acked, so you're looking at extra round trips to communicate that), but that's not a highly available protocol any more as any outage wedges the system. Also, Raft is sort of "too batteries included" to make this work. If you imagined a consensus algorithm that is more modular, you could have

- a failure detection layer

- a (very highly available) replication configuration layer

- the actual replication layer

And then you could configure the replication layer to commit writes only after it is applied (i.e. visible to reads, which implies acked) on all replicas. If a replica fails, you use the failure detection layer to ensure that the failed replica can't be serving any more reads (for example, via a time-based mechanism - note that the replica may be unavailable, but not online, this can happen via network partitions among others). When that's the case, you use the configuration layer to replace the failed replica, and allow writes again.

There are other reasons why this won't work as easily in CockroachDB. One reason is the timestamp cache, which is a data structure that makes sure that writes cannot invalidate earlier reads. This is populated by incoming reads, so if more than one replica serves reads, you have to consistently replicate this data structure, which means you're back to a worse problem than you had initially. But it could work if you only want snapshot isolation instead of serializability (and CockroachDB's guarantees are actually a lot stronger than just serializability).

You can probably tell at this point that this is a much bigger headache than "just" replicating to all replicas. Otherwise, chances are we'd be doing it already. :-)

Tobias

--
You received this message because you are subscribed to the Google Groups "CockroachDB" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cockroach-db...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/cockroach-db/92db67aa-cf94-4373-b10a-6bd58e7f06d1%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Seyed Hossein Mortazavi

unread,

Jul 7, 2019, 12:28:35 AM7/7/19

to CockroachDB

Thank you for your detailed response, what I'm thinking of is a research problem, so I'm more concerned about correctness issues. To confirm that I understood correctly: for writes changing the quorum to all isn't quite enough and I should make sure that things are committed, and after that a read local would suffice.

I'm aware of the availability issue, which I assume would be a limitation based on the CAP theorem. However, I think applications that have much more reads than writes can benefit from it.

To unsubscribe from this group and stop receiving emails from it, send an email to cockro...@googlegroups.com.

Reply all

Reply to author

Forward