Hibernating Rhinos Ltd
Oren Eini l CEO l Mobile: + 972-52-548-6969
Office: +972-4-622-7811 l Fax: +972-153-4-622-7811
--
You received this message because you are subscribed to the Google Groups "raft-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to raft-dev+u...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Hibernating Rhinos Ltd
Oren Eini l CEO l Mobile: + 972-52-548-6969
Office: +972-4-622-7811 l Fax: +972-153-4-622-7811
Hibernating Rhinos Ltd
Oren Eini l CEO l Mobile: + 972-52-548-6969
Office: +972-4-622-7811 l Fax: +972-153-4-622-7811
--
I still think that you ignore the fact that you have a cost _per_ message.In the case of RocksDB, flushing to the log is the expensive part.
Henrik, I think you are not quite right.Raft is consensus protocol, but it does not handle byzantine failure.
Imagine, you have three servers A,B and C.
B were a leader, and it had connectivity with A.
B wrote record R1-1 to itself and A, so B assumes R1-1 is commited. Perhaps B even informs A that R1-1 is commited.
But A didn't flush this knowledge to disk, and it "reboots".
Then A gain connectivity with C. C didn't know about R1-1, and A forgot about R1-1,so C can become leader for term 2, and commit record R2-1 (together with A).
And now you have split brain.
So, while theoretically there is no need in durable storage, practically it ishard to build durable system without durable storage.
On Tue, Jun 14, 2016 at 9:44 PM, Юрий Соколов <funny....@gmail.com> wrote:Henrik, I think you are not quite right.Raft is consensus protocol, but it does not handle byzantine failure.
Just to clarify, I haven't been talking about byzantine failure, nor does it seem that you are.
(https://en.wikipedia.org/wiki/Byzantine_fault_tolerance) Nor do I think trying to survive byzantine failures is a useful goal to set.
Imagine, you have three servers A,B and C.
B were a leader, and it had connectivity with A.
B wrote record R1-1 to itself and A, so B assumes R1-1 is commited. Perhaps B even informs A that R1-1 is commited.
But A didn't flush this knowledge to disk, and it "reboots".
Then A gain connectivity with C. C didn't know about R1-1, and A forgot about R1-1,so C can become leader for term 2, and commit record R2-1 (together with A).Yes, I'm well aware of this discussion, and you have simply affirmed my point.
- If A "reboots", then you would expect node A to shutdown cleanly and flush its state to disk before completely shutting down.
- What you intended to say is that A crashes (and then reboots). In other words you're describing a situation where a majority of nodes are unavailable. This is precisely what I said as well: If the requirement, and this is commonly the requirement, is to survive failure of a minority of nodes, then surviving the kind of failure you describe, is outside of requirements.
- The old school thinking in your argument is to think that data is somehow magically safe because it was written to a disk. If you replace "A reboots" with "A has a disk failure after which disk is replaced with a new disk" in your sequence above, then you still cannot survive the failure. Hence, the argument for relying only on the distributed consensus part for durability, is that disk failures needn't be treated as a special kind of failure. Disks are just like any other components that can fail.
And now you have split brain.Not if you implemented Raft correctly. Because B is not connected to a majority of the cluster, it must reject incoming requests. Hence it doesn't matter what state B is in.
So, while theoretically there is no need in durable storage, practically it ishard to build durable system without durable storage.Unfortunately, the "practical" world does not agree with this statement. In reality, pretty much all mainstream distributed databases have implementations that do not flush to disk on commit, either as the only option or as default option:
Only option: MySQL NDB Cluster
Default option: MySQL Galera Cluster, Riak, Cassandra, CouchbaseMongoDB by default is neither, but recent versions provide configurability to choose either option.I'm unaware of what Redis and Amazon DynamoDB do. And by this I believe I have covered all relevant DB products that have any kind of mainstream adoption. In reality then, aiming to survive minority failures is more than sufficient for achieving good HA, and the performance cost of flushing to disk is too big that people would actually be using it.
- The old school thinking in your argument is to think that data is somehow magically safe because it was written to a disk. If you replace "A reboots" with "A has a disk failure after which disk is replaced with a new disk" in your sequence above, then you still cannot survive the failure. Hence, the argument for relying only on the distributed consensus part for durability, is that disk failures needn't be treated as a special kind of failure. Disks are just like any other components that can fail.You are wrong. First: disk failure is such failure that system administrator should handle. And system administrator should not allow A to connect to cluster with wrong log.
But server may crash and quickly start without administrator inference (for example, if power supply fails for a part of a second), and if confirmed log records were synced to disk, it may safely connect to cluster.
And now you have split brain.Not if you implemented Raft correctly. Because B is not connected to a majority of the cluster, it must reject incoming requests. Hence it doesn't matter what state B is in.And you wrong again: B were connected to majority of the cluster - itself and A. It is majority! So Raft protocol allows B to assume it had committed R1-1.
Perhaps you mean C ? But C also connected to majority - itself and A, so it may commit R2-1.
This case could be fixed by not allowing A to connect to cluster after "failing", but it may lead to quick cluster degradation.
I've said "durable system". Not "practically used system".
Many users use "not durable" systems for performance.
...
So, you just confirmed my point: many "practical" systems are not "durable" by default, but theyallows to switch them in "durable" state.Lets just distinguish "durable" and "some users thinks it is durable enough", ok?
On Wed, Jun 15, 2016 at 3:18 PM, Юрий Соколов <funny....@gmail.com> wrote:- The old school thinking in your argument is to think that data is somehow magically safe because it was written to a disk. If you replace "A reboots" with "A has a disk failure after which disk is replaced with a new disk" in your sequence above, then you still cannot survive the failure. Hence, the argument for relying only on the distributed consensus part for durability, is that disk failures needn't be treated as a special kind of failure. Disks are just like any other components that can fail.You are wrong. First: disk failure is such failure that system administrator should handle. And system administrator should not allow A to connect to cluster with wrong log.
But server may crash and quickly start without administrator inference (for example, if power supply fails for a part of a second), and if confirmed log records were synced to disk, it may safely connect to cluster.As I'm explaining below, allowing A to connect with C does not cause split brain.It does of course cause R1-1 to be lost, I'm not arguing against that. I'm just saying that even when flushing the log to disk all the time, the world does still have failures where R1-1 can be lost. Hence, all that we can offer users is to be clear about what level of HA and durability a system offers. Neither your nor my opinion provides a 100% guarantee.
And now you have split brain.Not if you implemented Raft correctly. Because B is not connected to a majority of the cluster, it must reject incoming requests. Hence it doesn't matter what state B is in.And you wrong again: B were connected to majority of the cluster - itself and A. It is majority! So Raft protocol allows B to assume it had committed R1-1.
Perhaps you mean C ? But C also connected to majority - itself and A, so it may commit R2-1.Let me try to be clearer: B is no longer a primary, as it is not connected to a majority (neither A nor C, in fact).Your worrying is justified to the extent that there is a short lived anomaly in this situation: For a short period B will remain a primary, until it reaches its election timeout, and during this period clients can still connect to it and read R1-1. And after the election timeout is reached, R1-1 is no longer readable from anywhere.However, this loss of R1-1 is due to temporarily losing a majority of the cluster. Again, it is outside the scope of the given requirements.
This case could be fixed by not allowing A to connect to cluster after "failing", but it may lead to quick cluster degradation.Not necessary.
I've said "durable system". Not "practically used system".
Many users use "not durable" systems for performance....So, you just confirmed my point: many "practical" systems are not "durable" by default, but theyallows to switch them in "durable" state.Lets just distinguish "durable" and "some users thinks it is durable enough", ok?My point is twofold:- There is no argument that flushing the log is "more durable". However, even that is just an incremental improvement, it is naive (and IMO legacy) thinking to say it is perfectly durable, or even some kind of optimal state that an implementation must achieve.
- From a practical point of view, a system only implementing the more durable option, or even offering it as the default, is likely to see very little adoption. Offering both alternatives is of course ok. A minority of users will go for the more durable setting for sure.
PostgreSQL. It offers durable store by default. And rare user switches it to less durable option.Oracle, Microsoft SQL Server, IBM DB2.
Wish you never realize you are wrong.
Good day to you, Henrik.
--
You received this message because you are subscribed to a topic in the Google Groups "raft-dev" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/raft-dev/L3pPb1kH9hU/unsubscribe.
To unsubscribe from this group and all its topics, send an email to raft-dev+u...@googlegroups.com.