http://www.julianbrowne.com/article/viewer/brewers-cap-theorem
and a purported "proof" at
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.20.1495&rep=rep1&type=pdf
The CAP conjecture, I am convinced, is false and can be proven false.
The counterexample to the conjecture and the flaw in the "proof" is that
each assumes that a system needs to be serializable to be consistent.
Serializability is a sufficient condition for consistency, but is not a
necessary condition. A counterexample of a system that demonstrates CAP
false has the following characteristics:
1. The system is transactional, i.e. all updates within a transaction
are tentative until committed atomically.
2. Inconsistency can be detected during execution of a transaction so
that a transaction that would otherwise introduce inconsistency is
prevented from committing.
3. Commits are executed in the context of a partition tolerant commit
algorithm (a transaction is not committed until it has been
reported as committable to all potential surviving partitions).
The only serious question is whether a system can be both
non-serializable and consistent, and it isn't much of question given a
formal definition of consistency (concurrent transactions can't
overwrite each other, unique constraints must appear unique, etc.). It
is easier, cheaper, and bullet proof to serialize updates to specific
elements within a transaction than to try to serialize transactions
themselves. For example, every database record might have a
deterministic resolution agent that actively prevents concurrent update
in an multi-version concurrency control (MVCC) database.
The quintessential problem is that academic computer science has been
conflating the concepts of consistent and serializable for so long that
it has forgotten that they are not the same thing. If you substitute
"serializable" for "consistent", the CAP conjecture becomes the SAP
conjecture, and is probably true. So the trick is to be CAP without
being a SAP.
The CAP conjecture has been a theoretical millstone around the neck of
all ACID systems. Good riddance.
This is the first wooden stake for the heart of the noSQL movement.
There are more coming.
--
Jim Starkey
Founder, NimbusDB, Inc.
978 526-1376
I had thought the whole point of the CAP theorem was to describe
fundamental tradeoffs. No one that truly understands the implications
of that theorem would argue that relaxing C would enable improves
levels of both A and P. This is what I tried to explain in my
"Designing for the Cloud" slides: http://www.slideshare.net/StuC/oopsla-cloud-workshop-designing-for-the-cloud-elastra
In other words, it's not that the conjecture is false, it's that
people need to check their assumptions in applying it. It's not a
black and white scenario where consistency kills you, full stop.
Where I agree with you is the NoSql crowd's tendency, like with all
fads, to over generalize, and to ignore evidence in front of their
noses. "Eventual consistency" ala Dynamo is only one way to skin the
scalability cat.
Stu
Sent from my iPhone
> --
> ~~~~~
> Register Today for Cloud Slam 2010 at official website - http://cloudslam10.com
> Posting guidelines: http://groups.google.ca/group/cloud-computing/web/frequently-asked-questions
> Follow us on Twitter http://twitter.com/cloudcomp_group or
> @cloudcomp_group
> Post Job/Resume at http://cloudjobs.net
> Buy 88 conference sessions and panels on cloud computing on DVD at http://www.amazon.com/gp/product/B002H07SEC
> , http://www.amazon.com/gp/product/B002H0IW1U or get instant access
> to downloadable versions at http://cloudslam09.com/content/registration-5.html
>
> ~~~~~
> You received this message because you are subscribed to the Google
> Groups "Cloud Computing" group.
> To post to this group, send email to cloud-c...@googlegroups.com
> To unsubscribe from this group, send email to cloud-computi...@googlegroups.com
One question/comment:
Isn't it the case that your counterexample actually is serializable,
just not locally (with transaction) serializable? Instead, it's
serializable at the transaction level.
Greg Pfister
http://perilsofparallel.blogspot.com/
On Feb 25, 12:38 pm, Jim Starkey <jstar...@nimbusdb.com> wrote:
> Brewer's CAP conjecture is that a networked system cannot be consistent,
> available, and partition tolerant. A long description can be found at
>
> http://www.julianbrowne.com/article/viewer/brewers-cap-theorem
>
> and a purported "proof" at
>
> http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.20.1495&rep=...
a) NOSQL != !(SQL)
b) and so, !(NOSQL) != SQL
c) The unfortunate NOSQL term doesn't mean NoSQL (notice the capitalization). IMHO, what it says is that there are scenarios and characteristics which require less strict ACID and various flavors (ket-value store, document store, data srtucture store et al)
d) Hence, saying CAP theorem is false does not mean end of NOSQL debate
e) Granted, there is lots of rhetoric going on, but that is hardly a basis for sound theory or for that matter practice. For example a blog [http://blogs.forrester.com/appdev/2010/02/nosql.html] talks about NoSQL. Have no clue why folks write about things without researching through and that is a problem. In fact the blogger is going to write a report - am not sure if it is a threat or a promise ;o)
f) From my perspective the NOSQL has value in terms of scalability, internet scale distributability, resilience (the requirement of which is a by product of the internet infrastructure) ... I am not sure one would run Hadoop over Oracle (in it's current form. I am aware that database vendors are adding the mapReduce capability)
g) There is the NOSQL crowd and the NoSQL crowd, which of course, are not the same ...
Cheers
<k/>
> --
> ~~~~~
> Register Today for Cloud Slam 2010 at official website - http://cloudslam10.com
> Posting guidelines: http://groups.google.ca/group/cloud-computing/web/frequently-asked-questions
> Follow us on Twitter http://twitter.com/cloudcomp_group or @cloudcomp_group
> Post Job/Resume at http://cloudjobs.net
> Buy 88 conference sessions and panels on cloud computing on DVD at http://www.amazon.com/gp/product/B002H07SEC, http://www.amazon.com/gp/product/B002H0IW1U or get instant access to downloadable versions at http://cloudslam09.com/content/registration-5.html
Interesting and useful, Jim. I've always been leery of CAP, myself. One question/comment: Isn't it the case that your counterexample actually is serializable, just not locally (with transaction) serializable? Instead, it's serializable at the transaction level.
Jim, Interesting ... Some points: a) NOSQL != !(SQL) b) and so, !(NOSQL) != SQL c) The unfortunate NOSQL term doesn't mean NoSQL (notice the capitalization). IMHO, what it says is that there are scenarios and characteristics which require less strict ACID and various flavors (ket-value store, document store, data srtucture store et al) d) Hence, saying CAP theorem is false does not mean end of NOSQL debate e) Granted, there is lots of rhetoric going on, but that is hardly a basis for sound theory or for that matter practice. For example a blog [http://blogs.forrester.com/appdev/2010/02/nosql.html] talks about NoSQL. Have no clue why folks write about things without researching through and that is a problem. In fact the blogger is going to write a report - am not sure if it is a threat or a promise ;o) f) From my perspective the NOSQL has value in terms of scalability, internet scale distributability, resilience (the requirement of which is a by product of the internet infrastructure) ... I am not sure one would run Hadoop over Oracle (in it's current form. I am aware that database vendors are adding the mapReduce capability) g) There is the NOSQL crowd and the NoSQL crowd, which of course, are not the same ...
My slightly deeper point is that "database theory" is based as much on
folk lore than actual provable results. The consequences of this is
that progress in computing -- particularly in cloud computing, which has
these issues in spades -- has been unnecessarily retarded by shallow
thinking. The case in point is database scalability (insert shameless
plug for talk here). If you conflate consistency with serializability
you can prove that a single database can't be run on multiple computers
with those computers executing in lock step. If you are willing to
knock over a sacred cow, it is apparent that a database can scale
elastically to meet an arbitrary demand and can be geographically
disperse, each of which the CAP conjecture ('tain't a theorem no more)
says can't be done.
This is the point in history where database systems leave the rails and
start to fly.
>
> In other words, it's not that the conjecture is false, it's that
> people need to check their assumptions in applying it. It's not a
> black and white scenario where consistency kills you, full stop.
It is black and white. The CAP conjecture is, in fact, false. The SAP
conjecture is a different story, but who should care about
serializability if consistency can be achieved other ways? If you don't
have transactions that can be rolled back, I'll concede some validity
for the CAP conjecture, but then in the absence of transactions it's
next to impossible to achieve consistency anyway. But on it's face,
the CAP conjecture is just wrong.
I don't see serialization as a required implementation of
transactional consistency.
If P(artitioning) means loss of communication between transaction
participants, the transaction can only be rolled back and a new one
cannot be started. If you can run transactions without requiring the
partitions to be consistent (i.e. others to know you are having a
transaction), you aren't partitioned.
I think MVCC or not, all parts of the system need to agree
(=Consistency) on the values of the attributes being updated, before
commit ends.
So starting new transactions can work. But how do you commit without
propagating the knowledge about this among interested but partitioned
parties?
Sassa
On Feb 25, 7:38 pm, Jim Starkey <jstar...@nimbusdb.com> wrote:
> Brewer's CAP conjecture is that a networked system cannot be consistent,
> available, and partition tolerant. A long description can be found at
>
> http://www.julianbrowne.com/article/viewer/brewers-cap-theorem
>
> and a purported "proof" at
>
> http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.20.1495&rep=...
I guess, it depends on how you define consistency.
In my opinion, if C in CAP stands for Strong Consistency,
then the CAP theorem is correct. However, for relaxed
forms of consistency, e.g., eventual consistency,
http://en.wikipedia.org/wiki/Eventual_consistency
all three properties could be true at the same time.
Gabriel
Nice to see you're finally reading Lynch, Jim, only a month or two after
you clearly had no idea who she was or what the formal underpinnings of
CAP were. Unfortunately, you're a little premature in declaring victory
here. Refuting or limiting a proof is not the same as proving its
conclusion is false. If I say that 2+2=4 on Mondays, and you point out
that my proof is inapplicable six days of the week, you still haven't
proven that 2+2 equals anything but four on Tuesday through Sunday.
Similarly, even if Gilbert and Lynch's proof applies only to
serializable consistency, that does not preclude a separate and more
general proof of CAP. Most branches of mathematics are full of examples
where a theorem was proven for successively more general cases before it
was proven in its entirety.
Turning now to the substance of your supposed refutation, it seems not
to address the issue of when a transaction terminates - and it must
terminate in all cases to satisfy the Gilbert/Lynch definition of
availability (section 2.2). As soon as you need to ensure that
transactions terminate, you need to have a mechanism to coordinate
rollback of one or more transactions that might span multiple of your
"deterministic resolution agents" even in the presence of arbitrary
partitions. I'm sure many in the computer science community eagerly
await your explanation of how to solve that particular problem.
That's only the first problem that I noticed. I'll be glad to point out
more if you'd like. Someone might still manage to prove that there is
some useful set of definitions for which CAP is provably false, but it's
going to take an awful lot more than an "MVCC is magic" hand-wave. Be
careful you don't replace the C in CAP with Completely Ridiculous. ;)
Stuart Charlton wrote:My point is that there does not need to be a tradeoff at all. It is perfectly possible to construct a system that is consistent, available, and partition tolerant. It just can't be serializable.
Jim,
I had thought the whole point of the CAP theorem was to describe fundamental tradeoffs. No one that truly understands the implications of that theorem would argue that relaxing C would enable improves levels of both A and P. This is what I tried to explain in my "Designing for the Cloud" slides: http://www.slideshare.net/StuC/oopsla-cloud-workshop-designing-for-the-cloud-elastra
My slightly deeper point is that "database theory" is based as much on folk lore than actual provable results. The consequences of this is that progress in computing -- particularly in cloud computing, which has these issues in spades -- has been unnecessarily retarded by shallow thinking. The case in point is database scalability (insert shameless plug for talk here). If you conflate consistency with serializability you can prove that a single database can't be run on multiple computers with those computers executing in lock step. If you are willing
to knock over a sacred cow, it is apparent that a database can scale elastically to meet an arbitrary demand and can be geographically disperse, each of which the CAP conjecture ('tain't a theorem no more) says can't be done.
Interesting ! Can you define "consistency" ?
In my opinion, the whole objective of "consistency" is to make reasoning the application logic easy within a concurrent environment. Imagine our DB is a state machine and each user action is a trigger of a state transition. Without such isolation, the possible combination of user action (from different users) is a combinatorial explosion which make it almost impossible to reason the application's behavior.
To simplify our analysis, we "flatten" the combinatorial explosion by disallowing user actions (within a logical unit of work) interlacing with each other. In other words, the logical unit of work for each user is as if it has exclusive access to all states. This is the fundamental of serializability.
So I agree with you. From a pure theoretical perspective, "serializability" is a means to achieve the goal of "consistency". But practically, I don't see any other better alternatives.
Think about the difference between programming in a multi-thread environment using LOCKs, and compare that with Software Transaction Memory, you come to the same picture.
Rgds,
Ricky
----- Original Message ----
From: Jim Starkey <jsta...@nimbusdb.com>
To: cloud-c...@googlegroups.com
Sent: Thu, February 25, 2010 11:38:13 AM
Subject: [ Cloud Computing ] Brewer's CAP Conjecture is False
Brewer's CAP conjecture is that a networked system cannot be consistent,
available, and partition tolerant. A long description can be found at
http://www.julianbrowne.com/article/viewer/brewers-cap-theorem
and a purported "proof" at
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.20.1495&rep=rep1&type=pdf
The CAP conjecture, I am convinced, is false and can be proven false.
The counterexample to the conjecture and the flaw in the "proof" is that
each assumes that a system needs to be serializable to be consistent.
Here's a proof that it isn't serializable:Interesting and useful, Jim. I've always been leery of CAP, myself.
One question/comment:
Isn't it the case that your counterexample actually is serializable,
just not locally (with transaction) serializable? Instead, it's
serializable at the transaction level.
Greg Pfister
http://perilsofparallel.blogspot.com/
On Feb 25, 12:38 pm, Jim Starkey <jstar...@nimbusdb.com> wrote:
Brewer's CAP conjecture is that a networked system cannot be consistent,
available, and partition tolerant. A long description can be found at
http://www.julianbrowne.com/article/viewer/brewers-cap-theorem
and a purported "proof" at
That's only the first problem that I noticed. I'll be glad to point out more if you'd like. Someone might still manage to prove that there is some useful set of definitions for which CAP is provably false, but it's going to take an awful lot more than an "MVCC is magic" hand-wave. Be careful you don't replace the C in CAP with Completely Ridiculous. ;)
In other words, you give up availability. Brewer made exactly this
point on slide 16 of his 2000 PODC keynote
(http://www.cs.berkeley.edu/~brewer/cs262b-2004/PODC-keynote.pdf) where
he introduced the CAP Theorem. Congratulations, you've invented CP.
I don't believe there is -- or can be -- a universal definition of
consistency. Consistency must be what you dclare it to be. In a
database context, consistency is defined by the set of constraint
declarations: A set of unique constraints, a set of referential
integrity constraints, constraints on domains, etc. Other than simple
rules like "a record can't update a version of a record it couldn't
see", consistency should be explicit.
>
> In my opinion, the whole objective of "consistency" is to make reasoning the application logic easy within a concurrent environment. Imagine our DB is a state machine and each user action is a trigger of a state transition. Without such isolation, the possible combination of user action (from different users) is a combinatorial explosion which make it almost impossible to reason the application's behavior.
>
It is easy to demonstrate that a serializable database has at every
point a single definitive state. A non-serializable database, even a
consistent one, may never have a single definitive state due to message
skew of commit messages. On each node, a transaction sees the set of
transactions that have been reported committed, but since commit
messages are delivered asynchronously, concurrent transactions on
different nodes will see different sets of committed transactions, hence
no definitive database state.
But given that there isn't a single definitive state, the model of state
machine doesn't quite apply.
> To simplify our analysis, we "flatten" the combinatorial explosion by disallowing user actions (within a logical unit of work) interlacing with each other. In other words, the logical unit of work for each user is as if it has exclusive access to all states. This is the fundamental of serializability.
>
> So I agree with you. From a pure theoretical perspective, "serializability" is a means to achieve the goal of "consistency". But practically, I don't see any other better alternatives.
>
MVCC allows a transaction to a stable view of a volatile database,
subject to the rule that a transaction can't modify a version of record
it didn't see. MVCC isn't serializable, but can be made to enforce any
arbitrary definition of consistency.
> f) From my perspective the NOSQL has value in terms of scalability, internet
> scale distributability, resilience (the requirement of which is a by product
> of the internet infrastructure) ... I am not sure one would run Hadoop over
> Oracle (in it's current form. I am aware that database vendors are adding the
> mapReduce capability)
Aren't we missing the other big motivations for NOSQL?: extreme low latency
and extreme low cost (both update/insert cost and the cost of performing
aggregations/ queries) at massive scale - which I think is orthogonal to
CAP.
If Jim can give me those things as well as solving the CAP problem - then
sign me up!
--
Murray
Can you explain why MVCC is not serializable ?
In MVCC, every transaction maintaining a private copy of all data it has read and update. At commit time, the transaction submit the versions of this read/write set and compare the version stored in the DB. Commit will be successful in case the version matches, otherwise the transaction fail.
Lets say there are two concurrent transactions T1 and T2 overlap in time, and lets say T1 commit first (of course it succeed). After that T2 commits, it will only succeed if its (Read Union Write Set) does not intersect with T1's Write Set.
If T2 succeed, then it is equivalent to T2 starts AFTER T1 commits, which is equivalent to some serial transaction order, either T1, T2 or T2, T1
If T2 fails, then T2 need to restart the transaction and read the latest update from T1, which is equivalent to T1, T2.
So it is equivalent to some serial order of transaction execution, why is MVCC not serializable ?
There's a (somewhat) insightful description of the serialization anomalies that can arise with MVCC here:
http://en.wikipedia.org/wiki/Snapshot_isolation
Which is contains a description similar to one of Jim's prior posts, and hints that there may be new developments on this front.
IThe classic paper on the topic is here, which offers more detail....
http://www.cs.umb.edu/~poneil/iso.pdf
Stu
> On 2/25/10 1:59 PM, "Krishna Sankar" <ksan...@gmail.com> wrote:
>
>> f) From my perspective the NOSQL has value in terms of scalability, internet
>> scale distributability, resilience (the requirement of which is a by product
>> of the internet infrastructure) ... I am not sure one would run Hadoop over
>> Oracle (in it's current form. I am aware that database vendors are adding the
>> mapReduce capability)
>
> Aren't we missing the other big motivations for NOSQL?: extreme low latency
> and extreme low cost (both update/insert cost and the cost of performing
> aggregations/ queries) at massive scale - which I think is orthogonal to
> CAP.
No, CAP is all about the tradeoffs involved with managing data at massive scale.
The use of SQL, however, is completely orthogonal to massive scalability. As is the use or avoidance of joins.
There's plenty of evidence, for example, that aggregations & queries, MPP data warehouses are scalable and fast, and Hadoop really doesn't carry a performance edge. On the other hand, that's ignoring the relative work & time to do ETL to the warehouse vs. just writing Map & Reduce functions on flat data sets in HDFS.
For extreme low latency, I could see an argument against SQL or joins, yes. But you'd have to define "extreme". Oracle might take 100-400ms do a write transaction, 50-250ms for a read, with full consistency, joins, etc. Cassandra might do 25ms for a read, maybe < 1ms for a write, but with relaxed consistency and key/value lookup. Which is better depends on your requirements.
That brings us to "Low Cost", where I would paraphrase Jamie Zawinski: "open source is only free if your time has no value". For a startup, it's a no-brainer to use lower cost databases. For an enterprise, I think it's a more complicated question.
It's unfortunate that the punditry has conflated all of this. It's completely possible to build an RDBMS for massive scale, it just would look differently, and the SQL would feel somewhat differently. It wouldn't be as "ACID" as the sense we think of today - there would need to be more settings on the isolation knob than we have today.
Cheers
Stu
Jim, Can you explain why MVCC is not serializable ? In MVCC, every transaction maintaining a private copy of all data it has read and update. At commit time, the transaction submit the versions of this read/write set and compare the version stored in the DB. Commit will be successful in case the version matches, otherwise the transaction fail.
Lets say there are two concurrent transactions T1 and T2 overlap in time, and lets say T1 commit first (of course it succeed). After that T2 commits, it will only succeed if its (Read Union Write Set) does not intersect with T1's Write Set. If T2 succeed, then it is equivalent to T2 starts AFTER T1 commits, which is equivalent to some serial transaction order, either T1, T2 or T2, T1 If T2 fails, then T2 need to restart the transaction and read the latest update from T1, which is equivalent to T1, T2. So it is equivalent to some serial order of transaction execution, why is MVCC not serializable ?
For a distributed system to be continuously available, every request received by a non-failing node in the system must result in a response.In my counter-example, a non-failing node in the surviving partition can respond with a success or failure, as appropriate, and a node in a non-surviving partition return failure. A key element of the partition resolution algorithm is that a node can determine whether it is a member of a surviving partition or not, and respond accordingly.
-- Jim Starkey NimbusDB, Inc. 978 526-1376
Thanks for elaboration.
The MVCC mechanism your describe is not serializable and also will have problem since it only checks the overlapping of WRITE set but not the READ set.
If a transaction read X and set Y = X + 1, and then commit. It should fail if X has been modified and commit by other transaction in between. But the MVCC in your description doesn't catch this problem because there is no transaction modifies Y in between.
The key idea is that the transaction should mark the "version" of not just what it has written, but also "what it depends on". (in my above example, this is X)
In your example, the "record count" is what the transaction depends on, but unfortunately this is not a shared record in the DB for detecting conflict. To make this "dependent data" explicit, I would create a special record for this "global_count". So the transaction will update this global_count, and add the record before commit. So your MVCC transaction will like this ..
start transaction
count = 0
for each record in DB
count ++
insert new record <id, count>
update global_count = global_count + 1 <--- Look at this line
end
-- Jim Starkey NimbusDB, Inc. 978 526-1376
Jim,
I agree with you that "consistency" can be defined in a very app-specific way and serializability is a means to achieve this. Also they are not equivalent from a pure theoretical viewpoint. In practice, "serializability" is the only way to achieve "consistency" in a generic way. Looking at consistency in a case by case basis can be high risk and very error-prone.
But I do agree with you in theory.
Had you read further than that, you would have seen that "response" here
refers to a response that is essentially positive in nature - e.g. data
for a read. This is made quite explicit in the proof of Theorem
1 in section 3.1 (emphasis mine).
"we know that the read *returns a value* by the availability requirement"
This is the common, default interpretation of "providing a response" in
the distributed-system literature. Failure indications such as you
suggest are functionally equivalent to non-responses, and are generally
treated as such.
> I'm afraid you may have confused the formal definition of
> availability with the concept of "no node left behind."
No, I'm not the one who's confused here.
If you really feel you've refuted CAP, why not start up a dialog with
Brewer, Gilbert, or Lynch? They're not hard to contact, they're the
authorities on the the definitions and distinctions involved, and they'd
surely be interested in any such result.
Jeff Darcy wrote:Happily, not. The Gilbert and Lynch paper define availability:On 02/26/2010 10:54 AM, Jim Starkey wrote:In other words, you give up availability. Brewer made exactly this point on slide 16 of his 2000 PODC keynote (http://www.cs.berkeley.edu/~brewer/cs262b-2004/PODC-keynote.pdf) where he introduced the CAP Theorem. Congratulations, you've invented CP.
For a distributed system to be continuously available, every request received by a non-failing node in the system must result in a response.In my counter-example, a non-failing node in the surviving partition can respond with a success or failure, as appropriate, and a node in a non-surviving partition return failure. A key element of the partition resolution algorithm is that a node can determine whether it is a member of a surviving partition or not, and respond accordingly.
That's only one part (the A+P => !C part) of what CAP means. If you
were to look at Dynamo papers, for example, then you'd see that it is
indeed useful in some circumstances to leave nodes running, serving
possibly stale data, in preference to shutting them (and every other
local service that depends on them) down because of quorum loss. The
expectation is generally that communication will eventually be restored,
at which time any inconsistencies from the partition period will be
detected and resolved - unambiguously, though not necessarily in any
order that strong consistency would require. Search for "eventual
consistency" to learn more.
The same reasoning also applies, by the way, to some systems that
predate Brewer's statement of CAP. I'm thinking here of distributed
filesystems, especially those such as Coda that support disconnected
operation, in which the preference for serving possibly inconsistent
data over shutting down is explicitly extended all the way down to a
single isolated node - based on the same expectation of a subsequent
reconciliation.
> But do note that since no database systems guarantee that any
> transaction will successfully complete, no database system, even one
> running on a single system, can ever meet your definition of
> availability.
Availability requires that all non-failing nodes respond to requests.
They don't have to be current responses, so they don't have to reflect
that pending transaction. Since this is what the I in ACID is all about
anyway, database systems can and very often do meet this definition of
availability.
> In a more serious vein, it is certainly possible to design systems
> that:
>
> 1. Remain consistent in the face of any possible network failure. 2.
> Provide availability to the degree allowed by mathematics 3. Allow
> intelligent design to enable the survivability of the most beneficial
> of possible partitions.
>
> This is the applicability of consistency, availability, and partition
> tolerance: rejecting the trivial and building useful systems.
I'm sure the people at Amazon, LinkedIn, Facebook, etc. who've built
systems based on explicit recognition of CAP tradeoffs would be very
surprised to hear that those systems aren't useful. Just because you
personally might not find such systems useful doesn't mean others don't.
To quote Schopenhauer:
"Every man takes the limits of his own vision for the limits of the world."
> The idea that we should build systems that can tolerate a loss of
> half the the network because a orphan node couldn't operate is pretty
> dumb.
Yeah, it sure would be, but nobody but you has suggested any such thing.
There's certainly nothing in Brewer's work, or Gilbert and Lynch's, to
suggest anything like it. If you think otherwise, you're confused.
This scheme relies on the list of versions to be a complete list of
all transactions that occurred in the distributed system.
Does it work when some members don't report they've started/committed
a transaction? (=partitioned) If it does, then your distributed MVCC
DB should work with all nodes operating independently for an
indefinite amount of time. Which basically means that your distributed
DB doesn't need to communicate with any of its members. Ever. And
still be consistent. I find this really hard to prove. Can you
elaborate?
Sassa
> What you describe is not MVCC but optimistic concurrency control. They
> are not the same.> Lets say there are two concurrent transactions T1 and T2 overlap in time, and lets say T1 commit first (of course it succeed). After that T2 commits, it will only succeed if its (Read Union Write Set) does not intersect with T1's Write Set.
>
> > If T2 succeed, then it is equivalent to T2 starts AFTER T1 commits, which is equivalent to some serial transaction order, either T1, T2 or T2, T1
>
> > If T2 fails, then T2 need to restart the transaction and read the latest update from T1, which is equivalent to T1, T2.
>
> > So it is equivalent to some serial order of transaction execution, why is MVCC not serializable ?
>
> I described the example that showed MVCC wasn't serializable in a
> previous post, but here's the short version:
>
> 1. Single table with one numeric column.
> 2. Single transaction type that inserts a row reflecting the number
> of rows it could see.
> 3. A serializable system will have rows with all values from one to <n>
> 4. Starting with an empty table, two concurrent MVCC transactions
> >>http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.20.1495&rep=...
> >> Posting guidelines:http://groups.google.ca/group/cloud-computing/web/frequently-asked-qu...
> >> Follow us on Twitterhttp://twitter.com/cloudcomp_groupor @cloudcomp_group
> >> Post Job/Resume athttp://cloudjobs.net
> >> Buy 88 conference sessions and panels on cloud computing on DVD athttp://www.amazon.com/gp/product/B002H07SEC,http://www.amazon.com/gp/product/B002H0IW1Uor get instant access to downloadable versions athttp://cloudslam09.com/content/registration-5.html
In the transaction processing sense, consistency is related to
maintaining data integrity.
For example, assume the transaction (a logical unit of work) is a
transfer of $1000 from your savings account to your checking account.
You (the account holder) wants assurance that the system's behavior
will be consistent - for example, that identical queries run at the
same time will return the same answer and that the answer will be
correct. Neither the consumer nor the system architect wants a system
to execute a partial update that produces inconsistent data, such as
debiting the savings account for $1000 without also crediting the
checking account for the same amount.
The result of a credit card transaction could vary based on whether a
system is rigidly consistent, eventually consistent or not consistent.
If a card is reported stolen and the system is consistent, the update
of the card's status will prevent further purchases. If the system
does not enforce consistency, or is eventually consistent, other
transactions could be processed against the card for a time before the
status update ripples through the system..
>> There's plenty of evidence, for example, that aggregations &
queries, MPP data
warehouses are scalable and fast, and Hadoop really doesn't carry a
performance
edge.
The eBay data warehouse running on Greenplum is 6+ petabytes (PB). It's
an SQL solution that runs on 96 nodes.
(discussed here:
http://dobbscodetalk.com/index.php?option=com_myblog&show=Terabytes-to-Petabytes-Reflections-on-1999-2009.html&Itemid=29)
Owen O'Malley and Arun Murthy of Yahoo blogged about a project that
used 3800 nodes for Hadoop Map/Reduce to sort a petabyte in 16 hours
and a terabyte in 62 seconds.
http://developer.yahoo.net/blogs/hadoop/2009/05/hadoop_sorts_a_petabyte_in_162.html
On Feb 26, 8:43 pm, Jim Starkey <jstar...@nimbusdb.com> wrote:Ricky Ho wrote:Jim,Can you explain why MVCC is not serializable ?In MVCC, every transaction maintaining a private copy of all data it has read and update. At commit time, the transaction submit the versions of this read/write set and compare the version stored in the DB. Commit will be successful in case the version matches, otherwise the transaction fail.That's not the usual way for MVCC to work. There are alternatives (mostly used in record locking schemes), but the original design was to keep a list of record versions each tagged with the id of the transaction that created that version. When a transaction starts, it takes note of what transactions have been committed. When a transaction visits a record, it traverses the version list until it finds a record that either it created or was committed when it started. When a transaction wants to update or delete a record, it checks that the most recent version was either previously updated by the transaction or committed when the transaction began. Otherwise, it either reports an update conflict or, more reasonably, waits for an intervening transaction to end, and reports an error only if the other transaction committed.This scheme relies on the list of versions to be a complete list of all transactions that occurred in the distributed system.
Does it work when some members don't report they've started/committed a transaction? (=partitioned) If it does, then your distributed MVCC DB should work with all nodes operating independently for an indefinite amount of time. Which basically means that your distributed DB doesn't need to communicate with any of its members. Ever. And still be consistent. I find this really hard to prove. Can you elaborate?
That means that availability is guaranteed only in the partitions that
have a consistency agent. So scalability and partitioning resilience
of this solution depends on scalability and independence of
consistency agents.
> > Does it work when some members don't report they've started/committed
> > a transaction? (=partitioned) If it does, then your distributed MVCC
> > DB should work with all nodes operating independently for an
> > indefinite amount of time. Which basically means that your distributed
> > DB doesn't need to communicate with any of its members. Ever. And
> > still be consistent. I find this really hard to prove. Can you
> > elaborate?
>
> Resolution agents are necessary to avoid concurrent updates to the same
> record, unique indexes, and other declared consistency rules. Following
> a partition event, there three interesting cases:
>
> 1. A transaction is in a failing partition, in which it case it can't
> commit anyway.
> 2. A transaction and resolution agent are in the surviving partition,
> in which case there is no issue.
> 3. A transaction is in the surviving partition but the resolution
> agent isn't. In this case, there is sufficient information for a
> subsequently anointed resolution agent to reconstruct its
> consistency management state and go forward.
This is as obvious as CAP theorem. ;-)
The size of partition cannot be controlled. The consequences are:
* partitions need to know which one is a surviving partition and which
one is a failing partition. I think this is not trivial, if the
members are peers; and hard to scale, if the members are
hierarchically related
* availability in failing partitions is sacrificed (just as CAP or any
other expert says)
* scalability of the surviving partitions depends on the size of
partitions - we aren't talking about clouds = "pool of resources
larger than anyone will ever need" anymore
Sassa
> >>>> Buy 88 conference sessions and panels on cloud computing on DVD athttp://www.amazon.com/gp/product/B002H07SEC,http://www.amazon.com/gp/...get instant access to downloadable versions athttp://cloudslam09.com/content/registration-5.html
Am I missing something? Isn't this ye olde quorum determination?
Hasn't somebody somewhere laid out a scalable "count connected node
set" algorithm? Seems like it should be possible in log(N) time.
> * availability in failing partitions is sacrificed (just as CAP or any
> other expert says)
Yeah, but see note below.
> * scalability of the surviving partitions depends on the size of
> partitions - we aren't talking about clouds = "pool of resources
> larger than anyone will ever need" anymore
This is what always has confused me about CAP. What I recall of what
I've read seems to insist that all parts of a partitioned collection
still be able to satisfy all requests.
From my point of view, this seems a priori impossible for the usual
issues relating to split brain clusters. All RO requests are OK, if
you had universal replication of all data. Expensive. But while
reconciliation of some of updates may be possible after reconnection,
but some will conflict.
Which thoughts tell me why this discussion has veered into MVCC. Duh.
And also reinforce my reaction to CAP as saying "I want the impossible
and it's big news that I can't get it."
Greg Pfister
http://perilsofparallel.blogspot.com/
> ...
>
> read more »
* scalability of the surviving partitions depends on the size ofpartitions - we aren't talking about clouds = "pool of resourceslarger than anyone will ever need" anymore
This is what always has confused me about CAP. What I recall of what
I've read seems to insist that all parts of a partitioned collection
still be able to satisfy all requests.
From my point of view, this seems a priori impossible for the usual
issues relating to split brain clusters. All RO requests are OK, if
you had universal replication of all data. Expensive. But while
reconciliation of some of updates may be possible after reconnection,
but some will conflict.
On 2010-03-03, at 2:36 PM, Greg Pfister wrote:
* scalability of the surviving partitions depends on the size of
partitions - we aren't talking about clouds = "pool of resources
larger than anyone will ever need" anymore
This is what always has confused me about CAP. What I recall of what
I've read seems to insist that all parts of a partitioned collection
still be able to satisfy all requests.
That's if you choose Availability & Partition-tolerance.
For example, the Internet DNS is often used as an example of this kind of design. DNS still works for my LAN if I update it while it's partitioned from the bigger network. The bigger network can be updated while my LAN's DNS is partitioned. There isn't strong global consistency, but all updates eventually will propagate.
From my point of view, this seems a priori impossible for the usual
issues relating to split brain clusters. All RO requests are OK, if
you had universal replication of all data. Expensive. But while
reconciliation of some of updates may be possible after reconnection,
but some will conflict.
Yup, that's the price of A+P.
Databases with asynchronous log shipping , for example, I would consider another example of preferring A+P (with synchronous log shipping being a choice of C+P), though in practice there is stronger consistency with either approach than some of the extreme examples out there of A+P , arguably because the scale is smaller.
One way I've tried to explain how to think about the tradeoff is in terms of how big your "domain of control" is. When your system is small enough that you have control over the network, and can limit MTTR to very small bounds, you can relax partition tolerance and choose C+A, which is how most commercial clustered databases work (and why they often prefer to have redundant NICs, switches, etc.)..... But as you scale up, across racks, across switches, and eventually across geography & organizations, you have less control over the network failure rates and MTTR, so the software itself has to begin tolerating partitions.... and during failures, either you tell some portion of clients "sorry, can't help you right now", sacrificing availability, or you keep processing requests and have a way to prevent or handle consistency conflicts. A practical system would hopefully provide both options ...
Well, not exactly. Most updates eventually will propagate. Those that got clobbered will be lost. Which of several conflicting updates win is not deterministic. Also, it can be argued that a DNS server on an isolated partition is giving out wrong information.For example, the Internet DNS is often used as an example of this kind of design. DNS still works for my LAN if I update it while it's partitioned from the bigger network. The bigger network can be updated while my LAN's DNS is partitioned. There isn't strong global consistency, but all updates eventually will propagate.
That slop isn't good enough for a database system. If somebody in Tokyo and somebody in New York can each withdraw the last $1M from an account, there's going to be an unhappy banker who's going to think that getting the right answer trumps a funny definition of availability.
It's wrong to confuse parts of a system with the system as a whole. The fact that a number of parts of an available and functional system is irrelevant. What's important is that the system remains available. If you can't reach it, well, tough noggies.
Partition tolerant should mean that the system retains its integrity, consistency, and availability until it's chopped into such little pieces that none have the critical mass to continue. This is completely doable as long as you aren't going to insist that the laws of logic fail.
The serious and interesting questions are the requirements for a geographically disperse database system. You need the bandwidth to handle the replication, certainly. And an acceptable latency on the long haul parts so commit latency is acceptable. Also redundant storage at every site. Practically speaking, you probably also need to have the bulk of the updates be effectively local to a geographical region to minimize message traffic. But it's doable as long as you don't get hung up on the CAP conjecture.
One way I've tried to explain how to think about the tradeoff is in terms of how big your "domain of control" is. When your system is small enough that you have control over the network, and can limit MTTR to very small bounds, you can relax partition tolerance and choose C+A, which is how most commercial clustered databases work (and why they often prefer to have redundant NICs, switches, etc.)..... But as you scale up, across racks, across switches, and eventually across geography & organizations, you have less control over the network failure rates and MTTR, so the software itself has to begin tolerating partitions.... and during failures, either you tell some portion of clients "sorry, can't help you right now", sacrificing availability, or you keep processing requests and have a way to prevent or handle consistency conflicts. A practical system would hopefully provide both options ...
Isn't the day+ behind what's called as "pending" transactions? I'd
argue that's a different case entirely, one where the underlying
semantics are defined to be *consistent* with real-world physical
delays.
(Or with banking float-for-a-profit practices, but either way.)
Greg Pfister
http://perilsofparallel.blogspot.com/
Well, that will work if your cluster gets split into two partitions. I
don't think you can get quorum if it can be partitioned into more than
two partitions at once (=before the connectivity algorithm ends)
> > * availability in failing partitions is sacrificed (just as CAP or any
> > other expert says)
>
> Yeah, but see note below.
>
> > * scalability of the surviving partitions depends on the size of
> > partitions - we aren't talking about clouds = "pool of resources
> > larger than anyone will ever need" anymore
>
> This is what always has confused me about CAP. What I recall of what
> I've read seems to insist that all parts of a partitioned collection
> still be able to satisfy all requests.
Yes, ABLE TO :-)
> From my point of view, this seems a priori impossible for the usual
> issues relating to split brain clusters. All RO requests are OK, if
> you had universal replication of all data. Expensive. But while
> reconciliation of some of updates may be possible after reconnection,
> but some will conflict.
>
> Which thoughts tell me why this discussion has veered into MVCC. Duh.
>
> And also reinforce my reaction to CAP as saying "I want the impossible
> and it's big news that I can't get it."
I perceive CAP _is_ the latter - i.e. to settle it once and for all:
don't waste your time, this IS impossible; (and insisting that it is
possible means there is a flaw in your design, we won't spend our time
looking for the bug in your PhD thesis, etc).
Sassa
> Greg Pfisterhttp://perilsofparallel.blogspot.com/
I won't argue about the rest of the points right now, but I have a bad
feeling about "Any elastic service can be rapidly spun up to full
power by plugging in more machines."
A cloud provider doesn't really have an infinite set of resources.
That's what was bugging me from day one on this list. I found
reconciliation in the definition of infinity as "more resources than
anyone will ever need". Starting from this point, it may be really
hard to prove that any partition will have "more resources than anyone
will ever need", hence elasticity promises can be hard to achieve.
Say, a cloud provider had 1000 machines. You don't know how many, but
more than you need. Now after partitioning you end up with a survivor
on a 100-node DC. Your service cannot be rapidly spun up, because
there are no more machines to plug in.
This sort of problem doesn't concern stateless services. But
transactional computations are stateful.
Sassa
> >>>>>> Buy 88 conference sessions and panels on cloud computing on DVD athttp://www.amazon.com/gp/product/B002H07SEC,http://www.amazon.com/gp/...instant access to downloadable versions athttp://cloudslam09.com/content/registration-5.html
Sassa
But it pervaded subject areas that led to poor data quality - for
example, customer records that became inconsistent across departments
and divisions. Through the prism of CAP, this is a crude form of A+P.
This lead to massive data warehouses and "master data management"
products to clean the data and present a "single view of the
customer"...
Stu
Sent from my iPhone
> --
> ~~~~~
> Register Today for Cloud Slam 2010 at official website - http://cloudslam10.com
> Posting guidelines: http://groups.google.ca/group/cloud-computing/web/frequently-asked-questions
> Follow us on Twitter http://twitter.com/cloudcomp_group or
> @cloudcomp_group
> Post Job/Resume at http://cloudjobs.net
> Buy 88 conference sessions and panels on cloud computing on DVD at
> http://www.amazon.com/gp/product/B002H07SEC, http://www.amazon.com/gp/product/B002H0IW1U
> or get instant access to downloadable versions at http://cloudslam09.com/content/registration-5.html
* Not sure how this is going to work for infinite set of nodes
** Hence probably can't model the cloud as an infinite set of nodes
* Not sure how it works when the size of the cluster is dynamic, like
in elastic computing
** You should be concerned not with the current size of the
partitions, but with the future size of the partitions - i.e. it makes
sense to kill the partition that can't scale in preference to the
partition that can scale to a larger size
* Not sure how it works if partitioning ends up with more than two
partitions
** I.e. the meaning of "majority" is a strict inequality that one
partition can have more than a half of expected votes
** Cannot work even if you know the number of partitions, when it is
more than 2
Sassa
> >>>>>> Buy 88 conference sessions and panels on cloud computing on DVD athttp://www.amazon.com/gp/product/B002H07SEC,http://www.amazon.com/gp/...instant access to downloadable versions athttp://cloudslam09.com/content/registration-5.html
I have three problems with Molina's method of assigning votes in a distributed system: * Not sure how this is going to work for infinite set of nodes ** Hence probably can't model the cloud as an infinite set of nodes
* Not sure how it works when the size of the cluster is dynamic, like in elastic computing
** You should be concerned not with the current size of the partitions, but with the future size of the partitions - i.e. it makes sense to kill the partition that can't scale in preference to the partition that can scale to a larger size
* Not sure how it works if partitioning ends up with more than two partitions
** I.e. the meaning of "majority" is a strict inequality that one partition can have more than a half of expected votes
** Cannot work even if you know the number of partitions, when it is more than 2
On Mar 13, 7:32 pm, Jim Starkey <jstar...@nimbusdb.com> wrote:
> Sassa wrote:
> > I have three problems with Molina's method of assigning votes in a
> > distributed system:
>
> > * Not sure how this is going to work for infinite set of nodes
> > ** Hence probably can't model the cloud as an infinite set of nodes
>
> The concept of infinite nodes violates the laws of both economics and
> physics. Putting that aside, assume a finite set of "canary node" that
On a few occasions you mentioned cloud as an infinite pool of
resources, hence the remark.
A finite set of canaries means finite scale. Even if the complete set
of canaries survive, you can't say that you can scale out any time to
reconstitute any lost non-canary nodes.
I am referring here to your comment that a surviving partition can
scale to the necessary size.
Molina's work ends where clusters start to change their size over
time.
> participate in coteries. A non-canary node is part of the the surviving
> partition if it can communicate with at least one canary node in the
> surviving partition.> * Not sure how it works when the size of the cluster is dynamic, like
> > in elastic computing
>
> The coterie set can be changed by mutual consent. I'll leave the
> protocol as an exercise for the reader.> ** You should be concerned not with the current size of the
Yes, that's not the problem. The problem is that all partitions (the
dead ones and the surviving one) scale to different extent. So keeping
a partition that is locked in a 100-node room vs partition that
retained communication to 10 DCs has different implications.
> > partitions, but with the future size of the partitions - i.e. it makes
> > sense to kill the partition that can't scale in preference to the
> > partition that can scale to a larger size
>
> By definition, the partitions can't communicate, so it isn't possible to
> evaluate which is larger. That said, you can get the effect you want by
> the selection of coteries.> * Not sure how it works if partitioning ends up with more than two
> > partitions
>
> Not an issue. If any partition is large enough to contain a coteries,
> it is the only surviving partition.> ** I.e. the meaning of "majority" is a strict inequality that one
> > partition can have more than a half of expected votes
>
> Depending on how the coteries are selected, a majority isn't required.
> It's possible to have a hundred machines with only three coteries { a, b}, { b, c }, {a, c} such that only two of { a, b, c } would constitute a
>
> surviving partition. The flexibility of coteries let you plan for
> extreme sets of failures.> ** Cannot work even if you know the number of partitions, when it is
> > more than 2
>
> Assume there are n surviving partitions for n > 1. By the definition
> of surviving partition, each must contain a coterie. By the definition
> of partition, no two partitions can have communication nodes, therefore
> the coteries must be disjoint, which violates the definition of coteries.
OK, I understand that. This can work in a fixed-size cluster with no
proof of the surviving partition scaling out well. (and there are
partitions where all partitions will be deemed dead - i.e. {a},{b},
{c}; but ok, that's a partition not planned for, as you put it; and
yes, it does work for partitions planned for)
Sassa
> >>>>>>>> Buy 88 conference sessions and panels on cloud computing on DVD athttp://www.amazon.com/gp/product/B002H07SEC,http://www.amazon.com/gp/...access to downloadable versions athttp://cloudslam09.com/content/registration-5.html
Assume you have those {a,b,c} canaries. Assume {c} is not reachable anymore. {a,b} is the surviving partition. But scalability of that partition is not proven. Can't scale - availability suffers. Elasticity of the cloud can't add anything here.
Sassa wrote:
Sure it can -- add more nodes. It's not going to help the guys who can't reach them, but life is hard.
Assume you have those {a,b,c} canaries. Assume {c} is not reachable
anymore. {a,b} is the surviving partition. But scalability of that
partition is not proven. Can't scale - availability suffers.
Elasticity of the cloud can't add anything here.
If Brewer's conjecture is reduced to "no communication == no consistency", well, then, duh.
On Mar 13, 7:32 pm, Jim Starkey <jstar...@nimbusdb.com> <mailto:jstar...@nimbusdb.com> wrote:
Sassa
On Mar 3, 9:59 pm, Jim Starkey <jstar...@nimbusdb.com> <mailto:jstar...@nimbusdb.com> wrote:
Sassa wrote:
On Mar 1, 10:20 pm, Jim Starkey <jstar...@nimbusdb.com> <mailto:jstar...@nimbusdb.com> wrote:
Sassa wrote:
"add more nodes" as in "get the lorry on the road"? I am referring to
the limit imposed by the physical size of the partition. If the DC has
only 1000 nodes, then a surviving partition locked in that DC won't
scale beyond 1000 nodes
unless you ship a truck-load of more nodes, which isn't the same as
"scale out on demand to 'infinite' size".
> If Brewer's conjecture is reduced to "no communication == no
> consistency", well, then, duh.
yes, it is reduced to exactly that - "no communication == no
consistency OR no availability". that's what it says on the label
anyway!
some say "told you!", yet others say "I've got a system that proves it
wrong!"
I take CAP theorem is a formal way to put a stop to the argument
whether "my system can have all 3!"
Sassa
My reading is that he seems to think it's not so much wrong as
irrelevant -- not useful: Situations where it might apply it either
doesn't in reality, or else they occur very infrequently.
A likely contentious position presented is that in his experience LAN
partitions are "exceedingly rare" and WAN partitions "quite rare."
(And he references NimbusDB!)
Greg Pfister
http://perilsofparallel.blogspot.com/
Sassa
On Apr 8, 2:38 am, Greg Pfister <greg.pfis...@gmail.com> wrote:
> Some new fodder for this mill: Stonebraker weighs in on CAP in CACM:http://bit.ly/cCDWDS
>
> My reading is that he seems to think it's not so much wrong as
> irrelevant -- not useful: Situations where it might apply it either
> doesn't in reality, or else they occur very infrequently.
>
> A likely contentious position presented is that in his experience LAN
> partitions are "exceedingly rare" and WAN partitions "quite rare."
>
> (And he references NimbusDB!)
>
> Greg Pfisterhttp://perilsofparallel.blogspot.com/