Brewer's CAP Conjecture is False

3057 views
Skip to first unread message

Jim Starkey

unread,
Feb 25, 2010, 2:38:13 PM2/25/10
to cloud-c...@googlegroups.com
Brewer's CAP conjecture is that a networked system cannot be consistent,
available, and partition tolerant. A long description can be found at

http://www.julianbrowne.com/article/viewer/brewers-cap-theorem

and a purported "proof" at

http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.20.1495&rep=rep1&type=pdf

The CAP conjecture, I am convinced, is false and can be proven false.

The counterexample to the conjecture and the flaw in the "proof" is that
each assumes that a system needs to be serializable to be consistent.
Serializability is a sufficient condition for consistency, but is not a
necessary condition. A counterexample of a system that demonstrates CAP
false has the following characteristics:

1. The system is transactional, i.e. all updates within a transaction
are tentative until committed atomically.
2. Inconsistency can be detected during execution of a transaction so
that a transaction that would otherwise introduce inconsistency is
prevented from committing.
3. Commits are executed in the context of a partition tolerant commit
algorithm (a transaction is not committed until it has been
reported as committable to all potential surviving partitions).

The only serious question is whether a system can be both
non-serializable and consistent, and it isn't much of question given a
formal definition of consistency (concurrent transactions can't
overwrite each other, unique constraints must appear unique, etc.). It
is easier, cheaper, and bullet proof to serialize updates to specific
elements within a transaction than to try to serialize transactions
themselves. For example, every database record might have a
deterministic resolution agent that actively prevents concurrent update
in an multi-version concurrency control (MVCC) database.

The quintessential problem is that academic computer science has been
conflating the concepts of consistent and serializable for so long that
it has forgotten that they are not the same thing. If you substitute
"serializable" for "consistent", the CAP conjecture becomes the SAP
conjecture, and is probably true. So the trick is to be CAP without
being a SAP.

The CAP conjecture has been a theoretical millstone around the neck of
all ACID systems. Good riddance.

This is the first wooden stake for the heart of the noSQL movement.
There are more coming.


--
Jim Starkey
Founder, NimbusDB, Inc.
978 526-1376


Stuart Charlton

unread,
Feb 25, 2010, 3:58:19 PM2/25/10
to cloud-c...@googlegroups.com
Jim,

I had thought the whole point of the CAP theorem was to describe
fundamental tradeoffs. No one that truly understands the implications
of that theorem would argue that relaxing C would enable improves
levels of both A and P. This is what I tried to explain in my
"Designing for the Cloud" slides: http://www.slideshare.net/StuC/oopsla-cloud-workshop-designing-for-the-cloud-elastra

In other words, it's not that the conjecture is false, it's that
people need to check their assumptions in applying it. It's not a
black and white scenario where consistency kills you, full stop.

Where I agree with you is the NoSql crowd's tendency, like with all
fads, to over generalize, and to ignore evidence in front of their
noses. "Eventual consistency" ala Dynamo is only one way to skin the
scalability cat.

Stu


Sent from my iPhone

> --
> ~~~~~
> Register Today for Cloud Slam 2010 at official website - http://cloudslam10.com
> Posting guidelines: http://groups.google.ca/group/cloud-computing/web/frequently-asked-questions
> Follow us on Twitter http://twitter.com/cloudcomp_group or
> @cloudcomp_group
> Post Job/Resume at http://cloudjobs.net
> Buy 88 conference sessions and panels on cloud computing on DVD at http://www.amazon.com/gp/product/B002H07SEC
> , http://www.amazon.com/gp/product/B002H0IW1U or get instant access
> to downloadable versions at http://cloudslam09.com/content/registration-5.html
>
> ~~~~~
> You received this message because you are subscribed to the Google
> Groups "Cloud Computing" group.
> To post to this group, send email to cloud-c...@googlegroups.com
> To unsubscribe from this group, send email to cloud-computi...@googlegroups.com

Greg Pfister

unread,
Feb 25, 2010, 4:34:26 PM2/25/10
to Cloud Computing
Interesting and useful, Jim. I've always been leery of CAP, myself.

One question/comment:

Isn't it the case that your counterexample actually is serializable,
just not locally (with transaction) serializable? Instead, it's
serializable at the transaction level.

Greg Pfister
http://perilsofparallel.blogspot.com/

On Feb 25, 12:38 pm, Jim Starkey <jstar...@nimbusdb.com> wrote:
> Brewer's CAP conjecture is that a networked system cannot be consistent,
> available, and partition tolerant.  A long description can be found at
>
>      http://www.julianbrowne.com/article/viewer/brewers-cap-theorem
>
> and a purported "proof" at
>

> http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.20.1495&rep=...

Krishna Sankar

unread,
Feb 25, 2010, 4:59:02 PM2/25/10
to cloud-c...@googlegroups.com
Jim,
Interesting ... Some points:

a) NOSQL != !(SQL)
b) and so, !(NOSQL) != SQL
c) The unfortunate NOSQL term doesn't mean NoSQL (notice the capitalization). IMHO, what it says is that there are scenarios and characteristics which require less strict ACID and various flavors (ket-value store, document store, data srtucture store et al)
d) Hence, saying CAP theorem is false does not mean end of NOSQL debate
e) Granted, there is lots of rhetoric going on, but that is hardly a basis for sound theory or for that matter practice. For example a blog [http://blogs.forrester.com/appdev/2010/02/nosql.html] talks about NoSQL. Have no clue why folks write about things without researching through and that is a problem. In fact the blogger is going to write a report - am not sure if it is a threat or a promise ;o)
f) From my perspective the NOSQL has value in terms of scalability, internet scale distributability, resilience (the requirement of which is a by product of the internet infrastructure) ... I am not sure one would run Hadoop over Oracle (in it's current form. I am aware that database vendors are adding the mapReduce capability)
g) There is the NOSQL crowd and the NoSQL crowd, which of course, are not the same ...

Cheers
<k/>

> --
> ~~~~~
> Register Today for Cloud Slam 2010 at official website - http://cloudslam10.com
> Posting guidelines: http://groups.google.ca/group/cloud-computing/web/frequently-asked-questions
> Follow us on Twitter http://twitter.com/cloudcomp_group or @cloudcomp_group
> Post Job/Resume at http://cloudjobs.net

> Buy 88 conference sessions and panels on cloud computing on DVD at http://www.amazon.com/gp/product/B002H07SEC, http://www.amazon.com/gp/product/B002H0IW1U or get instant access to downloadable versions at http://cloudslam09.com/content/registration-5.html

Jim Starkey

unread,
Feb 25, 2010, 5:21:54 PM2/25/10
to cloud-c...@googlegroups.com
Greg Pfister wrote:
Interesting and useful, Jim. I've always been leery of CAP, myself.

One question/comment:

Isn't it the case that your counterexample actually is serializable,
just not locally (with transaction) serializable? Instead, it's
serializable at the transaction level.
  
Here's a proof that it isn't serializable:
  1. Assume a single table with a single numeric column.
  2. Assume a single transaction type T that counts the number of rows and stores a new row with that count.
  3. Two concurrent transactions start against an empty table.  Each stores the number 0.
  4. If the system were serializable, there could not be duplicate values. QED
If your really wanted to enforce unique values for the column, you could put a unique index on it which would cause one of the concurrent transactions to fail.

MVCC (multi-version concurrency control) has always been consistent but non-serializable (Bill Noyce, once of DEC, gets full credit for the proof).  Apply MVCC to the cloud, add agents to serialize updates of discrete elements, and magic happens.

Jim Starkey

unread,
Feb 25, 2010, 5:41:21 PM2/25/10
to cloud-c...@googlegroups.com
Krishna Sankar wrote:
Jim,
	Interesting ... Some points:

	a)	NOSQL != !(SQL)
	b)	and so, !(NOSQL) != SQL
	c)	The unfortunate NOSQL term doesn't mean NoSQL (notice the capitalization). IMHO, what it says is that there are scenarios and characteristics which require less strict ACID and various flavors (ket-value store, document store, data srtucture store et al)
	d)	Hence, saying CAP theorem is false does not mean end of NOSQL debate
	e)	Granted, there is lots of rhetoric going on, but that is hardly a basis for sound theory or for that matter practice. For example a blog [http://blogs.forrester.com/appdev/2010/02/nosql.html] talks about NoSQL. Have no clue why folks write about things without researching through and that is a problem. In fact the blogger is going to write a report - am not sure if it is a threat or a promise ;o)
	f)	From my perspective the NOSQL has value in terms of scalability, internet scale distributability, resilience (the requirement of which is a by product of the internet infrastructure) ... I am not sure one would run Hadoop over Oracle (in it's current form. I am aware that database vendors are adding the mapReduce capability) 
	g)	There is the NOSQL crowd and the NoSQL crowd, which of course, are not the same ...

  
Can we compromise on "NoSequel"?  

The common denominator is not the absence of SQL but the absence of atomic transactions, which makes SQL untenable (MySQL and Monty haveno problem with SQL without transactions, but that's another story).  Key/value stores compensate with arbitrarily complex "rows", which allows them to package everything associated with a shopping cart transaction in a single row.

I don't think there is any user demand for key/value stores or Map/Reduce data stores (exception, maybe business analytics).  There is user demand to data stores that scale and provide geographically disperse redundancy.  If these requirements can be met with relational databases, NoSequel has almost nothing to offer.

I'm beginning to enjoy saying that NimbusDB is hardware fault tolerant, software fault tolerant, and geological fault tolerant.  Haven't decided whether human fault tolerant goes in V1 or V2.

Jim Starkey

unread,
Feb 25, 2010, 5:44:15 PM2/25/10
to cloud-c...@googlegroups.com
Stuart Charlton wrote:
> Jim,
>
> I had thought the whole point of the CAP theorem was to describe
> fundamental tradeoffs. No one that truly understands the implications
> of that theorem would argue that relaxing C would enable improves
> levels of both A and P. This is what I tried to explain in my
> "Designing for the Cloud" slides:
> http://www.slideshare.net/StuC/oopsla-cloud-workshop-designing-for-the-cloud-elastra
>
My point is that there does not need to be a tradeoff at all. It is
perfectly possible to construct a system that is consistent, available,
and partition tolerant. It just can't be serializable.

My slightly deeper point is that "database theory" is based as much on
folk lore than actual provable results. The consequences of this is
that progress in computing -- particularly in cloud computing, which has
these issues in spades -- has been unnecessarily retarded by shallow
thinking. The case in point is database scalability (insert shameless
plug for talk here). If you conflate consistency with serializability
you can prove that a single database can't be run on multiple computers
with those computers executing in lock step. If you are willing to
knock over a sacred cow, it is apparent that a database can scale
elastically to meet an arbitrary demand and can be geographically
disperse, each of which the CAP conjecture ('tain't a theorem no more)
says can't be done.

This is the point in history where database systems leave the rails and
start to fly.

>
> In other words, it's not that the conjecture is false, it's that
> people need to check their assumptions in applying it. It's not a
> black and white scenario where consistency kills you, full stop.

It is black and white. The CAP conjecture is, in fact, false. The SAP
conjecture is a different story, but who should care about
serializability if consistency can be achieved other ways? If you don't
have transactions that can be rolled back, I'll concede some validity
for the CAP conjecture, but then in the absence of transactions it's
next to impossible to achieve consistency anyway. But on it's face,
the CAP conjecture is just wrong.

Sassa

unread,
Feb 25, 2010, 5:56:14 PM2/25/10
to Cloud Computing
It may depend on a difference in what C, A and P mean to you.

I don't see serialization as a required implementation of
transactional consistency.

If P(artitioning) means loss of communication between transaction
participants, the transaction can only be rolled back and a new one
cannot be started. If you can run transactions without requiring the
partitions to be consistent (i.e. others to know you are having a
transaction), you aren't partitioned.

I think MVCC or not, all parts of the system need to agree
(=Consistency) on the values of the attributes being updated, before
commit ends.

So starting new transactions can work. But how do you commit without
propagating the knowledge about this among interested but partitioned
parties?


Sassa


On Feb 25, 7:38 pm, Jim Starkey <jstar...@nimbusdb.com> wrote:
> Brewer's CAP conjecture is that a networked system cannot be consistent,
> available, and partition tolerant.  A long description can be found at
>
>      http://www.julianbrowne.com/article/viewer/brewers-cap-theorem
>
> and a purported "proof" at
>

> http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.20.1495&rep=...

Gabriel Mateescu

unread,
Feb 25, 2010, 7:12:22 PM2/25/10
to cloud-c...@googlegroups.com
On Thu, Feb 25, 2010 at 2:38 PM, Jim Starkey <jsta...@nimbusdb.com> wrote:
> Brewer's CAP conjecture is that a networked system cannot be consistent,
> available, and partition tolerant.  [...]

> The CAP conjecture, I am convinced, is false and can be proven false.
>
> The counterexample to the conjecture and the flaw in the "proof" is that
> each assumes that a system needs to be serializable to be consistent.

I guess, it depends on how you define consistency.

In my opinion, if C in CAP stands for Strong Consistency,
then the CAP theorem is correct. However, for relaxed
forms of consistency, e.g., eventual consistency,

http://en.wikipedia.org/wiki/Eventual_consistency

all three properties could be true at the same time.


Gabriel

Miha Ahronovitz

unread,
Feb 25, 2010, 11:08:47 PM2/25/10
to cloud-c...@googlegroups.com
Gabriel,

This is clearest contribution to this thread, which probably is familiar to about 300 hundred people on this group. This means 9,000 people read this emails "ca mitza in calendar" (Romanian) which means they are completely outside the topic.
Engineering is about compromises. So (reading from the wikipedia link you provided),if we accept  BASE (Basically Available, Soft state, Eventual consistency), as opposed to the database concept of ACID ((atomicity, consistency, isolation, durability) we Consistency too

Now has this to do with the clouds? I think what Jim wants to say that his idea of a Nimbus  data base that one can upload to the cloud, and automatically joins seamlessly all other Nimbus DB uploaded by thousands  others  to the cloud, and distributes itself on available servers (Scalable) and has Partition Tolerance( if  one or more  servers fail,  also can be Consistent is we accept the "eventual consistencies" like BASE instead of ACID  consistency.)

That means there is a solution. Sure banks want ACID, bur there must be a market, huge, outside finance that will tolerate BASE. That means Nimbus has a theoretical  future

How big is this market BASE market?


Miha


From: Gabriel Mateescu <gabriel....@gmail.com>
To: cloud-c...@googlegroups.com
Sent: Thu, February 25, 2010 4:12:22 PM
Subject: Re: [ Cloud Computing ] Brewer's CAP Conjecture is False

Jeff Darcy

unread,
Feb 26, 2010, 12:12:51 AM2/26/10
to cloud-c...@googlegroups.com
Jim Starkey wrote:
> Brewer's CAP conjecture is that a networked system cannot be consistent,
> available, and partition tolerant. A long description can be found at
>
> http://www.julianbrowne.com/article/viewer/brewers-cap-theorem
>
> and a purported "proof" at
>
> http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.20.1495&rep=rep1&type=pdf
>
> The CAP conjecture, I am convinced, is false and can be proven false.

Nice to see you're finally reading Lynch, Jim, only a month or two after
you clearly had no idea who she was or what the formal underpinnings of
CAP were. Unfortunately, you're a little premature in declaring victory
here. Refuting or limiting a proof is not the same as proving its
conclusion is false. If I say that 2+2=4 on Mondays, and you point out
that my proof is inapplicable six days of the week, you still haven't
proven that 2+2 equals anything but four on Tuesday through Sunday.
Similarly, even if Gilbert and Lynch's proof applies only to
serializable consistency, that does not preclude a separate and more
general proof of CAP. Most branches of mathematics are full of examples
where a theorem was proven for successively more general cases before it
was proven in its entirety.

Turning now to the substance of your supposed refutation, it seems not
to address the issue of when a transaction terminates - and it must
terminate in all cases to satisfy the Gilbert/Lynch definition of
availability (section 2.2). As soon as you need to ensure that
transactions terminate, you need to have a mechanism to coordinate
rollback of one or more transactions that might span multiple of your
"deterministic resolution agents" even in the presence of arbitrary
partitions. I'm sure many in the computer science community eagerly
await your explanation of how to solve that particular problem.

That's only the first problem that I noticed. I'll be glad to point out
more if you'd like. Someone might still manage to prove that there is
some useful set of definitions for which CAP is provably false, but it's
going to take an awful lot more than an "MVCC is magic" hand-wave. Be
careful you don't replace the C in CAP with Completely Ridiculous. ;)

Stuart Charlton

unread,
Feb 26, 2010, 1:17:11 AM2/26/10
to cloud-c...@googlegroups.com
On Thu, Feb 25, 2010 at 2:44 PM, Jim Starkey <jsta...@nimbusdb.com> wrote:
Stuart Charlton wrote:
Jim,

I had thought the whole point of the CAP theorem was to describe fundamental tradeoffs.  No one that truly understands the implications of that theorem would argue that relaxing C would enable improves levels of both A and P.  This is what I tried to explain in my "Designing for the Cloud" slides: http://www.slideshare.net/StuC/oopsla-cloud-workshop-designing-for-the-cloud-elastra
My point is that there does not need to be a tradeoff at all.  It is perfectly possible to construct a system that is consistent, available, and partition tolerant.  It just can't be serializable.

OK, but that's what I thought section 4.4 of Gilbert & Lynch's proof was talking about, when they bring up Weaker Consistency in a Partially Synchronous case.  You're absolutely right that you can have all three properties, if you're willing to tolerate consistent, but stale data.  

I think the counter-example you originally give doesn't actually prove your point, if I'm reading it correctly.  Your point #2, hurts the "A" in CAP, since you're preventing a transaction from committing because it might introduce inconsistency.   Now, there are normal database integrity reasons for preventing an inconsistent update (optimistic concurrency checks, for example), which are fine.   But, if my database chooses to prevent inconsistent updates because of a partition, I'm reducing availability for the sake of consistency -- I've chosen C & P at the expense of A.  

On the other hand, what I think WOULD fit your broader point would be to consider an MVCC case with some kind of snapshot (not serializable) isolation.   In that case, the client might only be able to access stale, but consistent data, relative to some previous snapshot epoch.    

This case is what I think you're going after -- you relax C to snapshot-level consistency, to enable both Availability (my transactions will commit because they were based on a known snapshot), and Partition tolerance (which just means the data is going to be more stale than it would be if my network was golden).    The devil would be in the details, of course.    This is more or less how Vertica works today, for example, and I gather one of the things Stonebraker's trying to do with VoltDB based on their 2-page teaser.



My slightly deeper point is that "database theory" is based as much on folk lore than actual provable results.  The consequences of this is that progress in computing -- particularly in cloud computing, which has these issues in spades -- has been unnecessarily retarded by shallow thinking.  The case in point is database scalability (insert shameless plug for talk here).  If you conflate consistency with serializability you can prove that a single database can't be run on multiple computers with those computers executing in lock step.  If you are willing

Agree with you generally here.
 
to knock over a sacred cow, it is apparent that a database can scale elastically to meet an arbitrary demand and can be geographically disperse, each of which the CAP conjecture ('tain't a theorem no more) says can't be done.


My objection is that I don't think the CAP conjecture actually says it can't be done.  It's the "popular & flawed interpretation of CAP by a new generation of young developers" that might be claiming it can't be done.   

There's a serious bandwagon effect on multiple fronts every time we have major technology shifts like this, and it takes years to sort out the folklore from the reality.  I applaud your efforts :)

Stu


Ricky Ho

unread,
Feb 26, 2010, 10:02:42 AM2/26/10
to cloud-c...@googlegroups.com
Jim,


Interesting ! Can you define "consistency" ?


In my opinion, the whole objective of "consistency" is to make reasoning the application logic easy within a concurrent environment. Imagine our DB is a state machine and each user action is a trigger of a state transition. Without such isolation, the possible combination of user action (from different users) is a combinatorial explosion which make it almost impossible to reason the application's behavior.

To simplify our analysis, we "flatten" the combinatorial explosion by disallowing user actions (within a logical unit of work) interlacing with each other. In other words, the logical unit of work for each user is as if it has exclusive access to all states. This is the fundamental of serializability.

So I agree with you. From a pure theoretical perspective, "serializability" is a means to achieve the goal of "consistency". But practically, I don't see any other better alternatives.

Think about the difference between programming in a multi-thread environment using LOCKs, and compare that with Software Transaction Memory, you come to the same picture.

Rgds,
Ricky


----- Original Message ----
From: Jim Starkey <jsta...@nimbusdb.com>
To: cloud-c...@googlegroups.com

Sent: Thu, February 25, 2010 11:38:13 AM
Subject: [ Cloud Computing ] Brewer's CAP Conjecture is False

Brewer's CAP conjecture is that a networked system cannot be consistent,
available, and partition tolerant. A long description can be found at

http://www.julianbrowne.com/article/viewer/brewers-cap-theorem

and a purported "proof" at

http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.20.1495&rep=rep1&type=pdf

The CAP conjecture, I am convinced, is false and can be proven false.

The counterexample to the conjecture and the flaw in the "proof" is that


each assumes that a system needs to be serializable to be consistent.

Ricky Ho

unread,
Feb 26, 2010, 10:03:06 AM2/26/10
to cloud-c...@googlegroups.com
In your proof example, if you turn on "serializable" in isolation level, one of the transaction will fail.

I think MVCC is just another way of implementing Serialization, instead of using a pessimistic LOCK strategy, it use an optimistic version mechanism.  In case if the version has updated, the transaction will fail, which is equivalent to the transaction has execute successfully before but overwritten by a later transaction.

Rgds,
Ricky

Sent: Thu, February 25, 2010 2:21:54 PM
Subject: Re: [ Cloud Computing ] Re: Brewer's CAP Conjecture is False

Greg Pfister wrote:
Interesting and useful, Jim. I've always been leery of CAP, myself.

One question/comment:

Isn't it the case that your counterexample actually is serializable,
just not locally (with transaction) serializable? Instead, it's
serializable at the transaction level.
Here's a proof that it isn't serializable:
  1. Assume a single table with a single numeric column.
  2. Assume a single transaction type T that counts the number of rows and stores a new row with that count.
  3. Two concurrent transactions start against an empty table.  Each stores the number 0.
  4. If the system were serializable, there could not be duplicate values. QED
If your really wanted to enforce unique values for the column, you could put a unique index on it which would cause one of the concurrent transactions to fail.

MVCC (multi-version concurrency control) has always been consistent but non-serializable (Bill Noyce, once of DEC, gets full credit for the proof).  Apply MVCC to the cloud, add agents to serialize updates of discrete elements, and magic happens.




On Feb 25, 12:38 pm, Jim Starkey <jstar...@nimbusdb.com> wrote:
Brewer's CAP conjecture is that a networked system cannot be consistent,
available, and partition tolerant.  A long description can be found at

     http://www.julianbrowne.com/article/viewer/brewers-cap-theorem

and a purported "proof" at

Jim Starkey

unread,
Feb 26, 2010, 10:54:35 AM2/26/10
to cloud-c...@googlegroups.com
No need to wait.  Hector Garcia Molina solved it 25 years ago.  A set of machines agree on a set on non-disjoint "coteries", where each coterie is a subset of the machines in the system.  After a partition event, a partition needs to contain at least one complete coterie to survive.  Since there are no disjoint coteries and no machine can be in more than one partition, there is at most one surviving coterie.  If no partition contains a complete coterie, everyone must shut down.

If a precommit is delivered to at least one machine in every coterie before it is considered committed, then it must have been reported in the surviving coterie.   If the precommit reaches some but not all of the systems, it will be committed if the precommit reached a machine in the surviving partition, otherwise it will be rolled back.



That's only the first problem that I noticed.  I'll be glad to point out
more if you'd like.  Someone might still manage to prove that there is
some useful set of definitions for which CAP is provably false, but it's
going to take an awful lot more than an "MVCC is magic" hand-wave.  Be
careful you don't replace the C in CAP with Completely Ridiculous.  ;)

  
Please do.  I eagerly await your reply.

Gilad Parann-Nissany

unread,
Feb 26, 2010, 11:13:05 AM2/26/10
to cloud-c...@googlegroups.com
Hi Jim

In your approach, if I get it, a transaction is broken down into multiple individually serializable sub-steps (each within a specific partition). Is that what you meant?

If so, some Qs:

1. how do you achieve consistency across these multiple sub-steps? If one fails, how do you roll back all the steps that were already committed? Are we talking here about sub-transactions?

2. is this new DB holding multiple copies of each partition (good for robustness & durability)? If so, does each sub-transaction "finish" only after all copies are updated?

Thanks.
Regards
Gilad
__________________
Gilad Parann-Nissany
CEO, Founder
Porticor Cloud Security
http://www.porticor.com/


Jeff Darcy

unread,
Feb 26, 2010, 11:21:19 AM2/26/10
to cloud-c...@googlegroups.com
On 02/26/2010 10:54 AM, Jim Starkey wrote:
> No need to wait. Hector Garcia Molina solved it 25 years ago. A set of
> machines agree on a set on non-disjoint "coteries", where each coterie
> is a subset of the machines in the system. After a partition event, a
> partition needs to contain at least one complete coterie to survive.

In other words, you give up availability. Brewer made exactly this
point on slide 16 of his 2000 PODC keynote
(http://www.cs.berkeley.edu/~brewer/cs262b-2004/PODC-keynote.pdf) where
he introduced the CAP Theorem. Congratulations, you've invented CP.

Pranta Das

unread,
Feb 26, 2010, 11:32:27 AM2/26/10
to cloud-c...@googlegroups.com
I completely agree. We live in a world where "One size does not fit all" but where consistency requirements will always be hybrid - some stringent (or serializable) others more relaxed (or loose).

If I remember my high-school chemistry well,

       ACID + BASE = SALT (Somewhat Acceptable Loosely-consistent Transactions) + H2O (Isn't that what Clouds are made of?)

Cheers,
Pranta


From: Miha Ahronovitz <mij...@sbcglobal.net>
To: cloud-c...@googlegroups.com
Sent: Thu, February 25, 2010 8:08:47 PM

Jim Starkey

unread,
Feb 26, 2010, 11:44:46 AM2/26/10
to cloud-c...@googlegroups.com
Ricky Ho wrote:
> Jim,
>
>
> Interesting ! Can you define "consistency" ?
>

I don't believe there is -- or can be -- a universal definition of
consistency. Consistency must be what you dclare it to be. In a
database context, consistency is defined by the set of constraint
declarations: A set of unique constraints, a set of referential
integrity constraints, constraints on domains, etc. Other than simple
rules like "a record can't update a version of a record it couldn't
see", consistency should be explicit.


>
> In my opinion, the whole objective of "consistency" is to make reasoning the application logic easy within a concurrent environment. Imagine our DB is a state machine and each user action is a trigger of a state transition. Without such isolation, the possible combination of user action (from different users) is a combinatorial explosion which make it almost impossible to reason the application's behavior.
>

It is easy to demonstrate that a serializable database has at every
point a single definitive state. A non-serializable database, even a
consistent one, may never have a single definitive state due to message
skew of commit messages. On each node, a transaction sees the set of
transactions that have been reported committed, but since commit
messages are delivered asynchronously, concurrent transactions on
different nodes will see different sets of committed transactions, hence
no definitive database state.

But given that there isn't a single definitive state, the model of state
machine doesn't quite apply.


> To simplify our analysis, we "flatten" the combinatorial explosion by disallowing user actions (within a logical unit of work) interlacing with each other. In other words, the logical unit of work for each user is as if it has exclusive access to all states. This is the fundamental of serializability.
>
> So I agree with you. From a pure theoretical perspective, "serializability" is a means to achieve the goal of "consistency". But practically, I don't see any other better alternatives.
>

MVCC allows a transaction to a stable view of a volatile database,
subject to the rule that a transaction can't modify a version of record
it didn't see. MVCC isn't serializable, but can be made to enforce any
arbitrary definition of consistency.

Murray Spork

unread,
Feb 26, 2010, 12:18:55 PM2/26/10
to cloud-c...@googlegroups.com
On 2/25/10 1:59 PM, "Krishna Sankar" <ksan...@gmail.com> wrote:

> f) From my perspective the NOSQL has value in terms of scalability, internet
> scale distributability, resilience (the requirement of which is a by product
> of the internet infrastructure) ... I am not sure one would run Hadoop over
> Oracle (in it's current form. I am aware that database vendors are adding the
> mapReduce capability)

Aren't we missing the other big motivations for NOSQL?: extreme low latency
and extreme low cost (both update/insert cost and the cost of performing
aggregations/ queries) at massive scale - which I think is orthogonal to
CAP.

If Jim can give me those things as well as solving the CAP problem - then
sign me up!

--
Murray

Ricky Ho

unread,
Feb 26, 2010, 12:45:00 PM2/26/10
to cloud-c...@googlegroups.com
Jim,

Can you explain why MVCC is not serializable ?

In MVCC, every transaction maintaining a private copy of all data it has read and update. At commit time, the transaction submit the versions of this read/write set and compare the version stored in the DB. Commit will be successful in case the version matches, otherwise the transaction fail.

Lets say there are two concurrent transactions T1 and T2 overlap in time, and lets say T1 commit first (of course it succeed). After that T2 commits, it will only succeed if its (Read Union Write Set) does not intersect with T1's Write Set.

If T2 succeed, then it is equivalent to T2 starts AFTER T1 commits, which is equivalent to some serial transaction order, either T1, T2 or T2, T1

If T2 fails, then T2 need to restart the transaction and read the latest update from T1, which is equivalent to T1, T2.

So it is equivalent to some serial order of transaction execution, why is MVCC not serializable ?

Stuart Charlton

unread,
Feb 26, 2010, 2:43:27 PM2/26/10
to cloud-c...@googlegroups.com
Ricky,

There's a (somewhat) insightful description of the serialization anomalies that can arise with MVCC here:
http://en.wikipedia.org/wiki/Snapshot_isolation

Which is contains a description similar to one of Jim's prior posts, and hints that there may be new developments on this front.

IThe classic paper on the topic is here, which offers more detail....
http://www.cs.umb.edu/~poneil/iso.pdf


Stu

Stuart Charlton

unread,
Feb 26, 2010, 3:08:40 PM2/26/10
to cloud-c...@googlegroups.com

On 2010-02-26, at 9:18 AM, Murray Spork wrote:

> On 2/25/10 1:59 PM, "Krishna Sankar" <ksan...@gmail.com> wrote:
>
>> f) From my perspective the NOSQL has value in terms of scalability, internet
>> scale distributability, resilience (the requirement of which is a by product
>> of the internet infrastructure) ... I am not sure one would run Hadoop over
>> Oracle (in it's current form. I am aware that database vendors are adding the
>> mapReduce capability)
>
> Aren't we missing the other big motivations for NOSQL?: extreme low latency
> and extreme low cost (both update/insert cost and the cost of performing
> aggregations/ queries) at massive scale - which I think is orthogonal to
> CAP.


No, CAP is all about the tradeoffs involved with managing data at massive scale.

The use of SQL, however, is completely orthogonal to massive scalability. As is the use or avoidance of joins.

There's plenty of evidence, for example, that aggregations & queries, MPP data warehouses are scalable and fast, and Hadoop really doesn't carry a performance edge. On the other hand, that's ignoring the relative work & time to do ETL to the warehouse vs. just writing Map & Reduce functions on flat data sets in HDFS.

For extreme low latency, I could see an argument against SQL or joins, yes. But you'd have to define "extreme". Oracle might take 100-400ms do a write transaction, 50-250ms for a read, with full consistency, joins, etc. Cassandra might do 25ms for a read, maybe < 1ms for a write, but with relaxed consistency and key/value lookup. Which is better depends on your requirements.

That brings us to "Low Cost", where I would paraphrase Jamie Zawinski: "open source is only free if your time has no value". For a startup, it's a no-brainer to use lower cost databases. For an enterprise, I think it's a more complicated question.

It's unfortunate that the punditry has conflated all of this. It's completely possible to build an RDBMS for massive scale, it just would look differently, and the SQL would feel somewhat differently. It wouldn't be as "ACID" as the sense we think of today - there would need to be more settings on the isolation knob than we have today.

Cheers
Stu

Jim Starkey

unread,
Feb 26, 2010, 3:43:37 PM2/26/10
to cloud-c...@googlegroups.com
Ricky Ho wrote:
Jim,

Can you explain why MVCC is not serializable ?

In MVCC, every transaction maintaining a private copy of all data it has read and update.  At commit time, the transaction submit the versions of this read/write set and compare the version stored in the DB.  Commit will be successful in case the version matches, otherwise the transaction fail.
  
That's not the usual way for MVCC to work.  There are alternatives (mostly used in record locking schemes), but the original design was to keep a list of record versions each tagged with the id of the transaction that created that version.  When a transaction starts, it takes note of what transactions have been committed.  When a transaction visits a record, it traverses the version list until it finds a record that either it created or was committed when it started.  When a transaction wants to update or delete a record, it checks that the most recent version was either previously updated by the transaction or committed when the transaction began.  Otherwise, it either reports an update conflict or, more reasonably, waits for an intervening transaction to end, and reports an error only if the other transaction committed.

What you describe is not MVCC but optimistic concurrency control.  They are not the same.

Lets say there are two concurrent transactions T1 and T2 overlap in time, and lets say T1 commit first (of course it succeed).  After that T2 commits, it will only succeed if its (Read Union Write Set) does not intersect with T1's Write Set.

If T2 succeed, then it is equivalent to T2 starts AFTER T1 commits, which is equivalent to some serial transaction order, either T1, T2 or T2, T1

If T2 fails, then T2 need to restart the transaction and read the latest update from T1, which is equivalent to T1, T2.

So it is equivalent to some serial order of transaction execution, why is MVCC not serializable ?
  
I described the example that showed MVCC wasn't serializable in a previous post, but here's the short version:
  1. Single table with one numeric column.
  2. Single transaction type that inserts a row reflecting the number of rows it could see.
  3. A serializable system will have rows with all values from one to <n>
  4. Starting with an empty table, two concurrent MVCC transactions will each store the value zero.
The optimistic concurrency control scheme that you describe will have the same result as MVCC and is also non-serializable.  But it's still a different animal.

Jim Starkey

unread,
Feb 27, 2010, 12:38:42 PM2/27/10
to cloud-c...@googlegroups.com
Happily, not.  The Gilbert and Lynch paper define availability:
For a distributed system to be continuously available, every request received by a non-failing node in the system must result in a response.
In my counter-example, a non-failing node in the surviving partition can respond with a success or failure, as appropriate, and a node in a non-surviving partition return failure.  A key element of the partition resolution algorithm is that a node can determine whether it is a member of a surviving partition or not, and respond accordingly.

I'm afraid you may have confused the formal definition of availability with the concept of "no node left behind."  It's ok to leave a node behind as long as it knows it's been left out in the cold.  Logically, this must be the case because two non-communicating partitions cannot operate consistently no matter how consistent is defined.



-- 
Jim Starkey
NimbusDB, Inc.
978 526-1376

Ricky Ho

unread,
Feb 27, 2010, 10:53:19 AM2/27/10
to cloud-c...@googlegroups.com
Thanks for elaboration.

The MVCC mechanism your describe is not serializable and also will have problem since it only checks the overlapping of WRITE set but not the READ set.

If a transaction read X and set Y = X + 1, and then commit.  It should fail if X has been modified and commit by other transaction in between.  But the MVCC in your description doesn't catch this problem because there is no transaction modifies Y in between.

The key idea is that the transaction should mark the "version" of not just what it has written, but also "what it depends on".  (in my above example, this is X)

In your example, the "record count" is what the transaction depends on, but unfortunately this is not a shared record in the DB for detecting conflict.  To make this "dependent data" explicit, I would create a special record for this "global_count".  So the transaction will update this global_count, and add the record before commit.  So your MVCC transaction will like this ..

start transaction
   count = 0
   for each record in DB
      count ++
   insert new record <id, count>
   update global_count = global_count + 1   <--- Look at this line
end

Rgds,
Ricky

Sent: Fri, February 26, 2010 12:43:37 PM