Brewer's CAP Conjecture is False

Jim Starkey

unread,

Feb 25, 2010, 2:38:13 PM2/25/10

to cloud-c...@googlegroups.com

Brewer's CAP conjecture is that a networked system cannot be consistent,
available, and partition tolerant. A long description can be found at

http://www.julianbrowne.com/article/viewer/brewers-cap-theorem

and a purported "proof" at

http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.20.1495&rep=rep1&type=pdf

The CAP conjecture, I am convinced, is false and can be proven false.

The counterexample to the conjecture and the flaw in the "proof" is that
each assumes that a system needs to be serializable to be consistent.
Serializability is a sufficient condition for consistency, but is not a
necessary condition. A counterexample of a system that demonstrates CAP
false has the following characteristics:

1. The system is transactional, i.e. all updates within a transaction
are tentative until committed atomically.
2. Inconsistency can be detected during execution of a transaction so
that a transaction that would otherwise introduce inconsistency is
prevented from committing.
3. Commits are executed in the context of a partition tolerant commit
algorithm (a transaction is not committed until it has been
reported as committable to all potential surviving partitions).

The only serious question is whether a system can be both
non-serializable and consistent, and it isn't much of question given a
formal definition of consistency (concurrent transactions can't
overwrite each other, unique constraints must appear unique, etc.). It
is easier, cheaper, and bullet proof to serialize updates to specific
elements within a transaction than to try to serialize transactions
themselves. For example, every database record might have a
deterministic resolution agent that actively prevents concurrent update
in an multi-version concurrency control (MVCC) database.

The quintessential problem is that academic computer science has been
conflating the concepts of consistent and serializable for so long that
it has forgotten that they are not the same thing. If you substitute
"serializable" for "consistent", the CAP conjecture becomes the SAP
conjecture, and is probably true. So the trick is to be CAP without
being a SAP.

The CAP conjecture has been a theoretical millstone around the neck of
all ACID systems. Good riddance.

This is the first wooden stake for the heart of the noSQL movement.
There are more coming.

--
Jim Starkey
Founder, NimbusDB, Inc.
978 526-1376

Stuart Charlton

unread,

Feb 25, 2010, 3:58:19 PM2/25/10

to cloud-c...@googlegroups.com

Jim,

I had thought the whole point of the CAP theorem was to describe
fundamental tradeoffs. No one that truly understands the implications
of that theorem would argue that relaxing C would enable improves
levels of both A and P. This is what I tried to explain in my
"Designing for the Cloud" slides: http://www.slideshare.net/StuC/oopsla-cloud-workshop-designing-for-the-cloud-elastra

In other words, it's not that the conjecture is false, it's that
people need to check their assumptions in applying it. It's not a
black and white scenario where consistency kills you, full stop.

Where I agree with you is the NoSql crowd's tendency, like with all
fads, to over generalize, and to ignore evidence in front of their
noses. "Eventual consistency" ala Dynamo is only one way to skin the
scalability cat.

Stu

Sent from my iPhone

> --
> ~~~~~
> Register Today for Cloud Slam 2010 at official website - http://cloudslam10.com
> Posting guidelines: http://groups.google.ca/group/cloud-computing/web/frequently-asked-questions
> Follow us on Twitter http://twitter.com/cloudcomp_group or
> @cloudcomp_group
> Post Job/Resume at http://cloudjobs.net
> Buy 88 conference sessions and panels on cloud computing on DVD at http://www.amazon.com/gp/product/B002H07SEC
> , http://www.amazon.com/gp/product/B002H0IW1U or get instant access
> to downloadable versions at http://cloudslam09.com/content/registration-5.html
>
> ~~~~~
> You received this message because you are subscribed to the Google
> Groups "Cloud Computing" group.
> To post to this group, send email to cloud-c...@googlegroups.com
> To unsubscribe from this group, send email to cloud-computi...@googlegroups.com

Greg Pfister

unread,

Feb 25, 2010, 4:34:26 PM2/25/10

to Cloud Computing

Interesting and useful, Jim. I've always been leery of CAP, myself.

One question/comment:

Isn't it the case that your counterexample actually is serializable,
just not locally (with transaction) serializable? Instead, it's
serializable at the transaction level.

Greg Pfister
http://perilsofparallel.blogspot.com/

On Feb 25, 12:38 pm, Jim Starkey <jstar...@nimbusdb.com> wrote:
> Brewer's CAP conjecture is that a networked system cannot be consistent,
> available, and partition tolerant. A long description can be found at
>
> http://www.julianbrowne.com/article/viewer/brewers-cap-theorem
>
> and a purported "proof" at
>

> http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.20.1495&rep=...

Krishna Sankar

unread,

Feb 25, 2010, 4:59:02 PM2/25/10

to cloud-c...@googlegroups.com

Jim,
Interesting ... Some points:

a) NOSQL != !(SQL)
b) and so, !(NOSQL) != SQL
c) The unfortunate NOSQL term doesn't mean NoSQL (notice the capitalization). IMHO, what it says is that there are scenarios and characteristics which require less strict ACID and various flavors (ket-value store, document store, data srtucture store et al)
d) Hence, saying CAP theorem is false does not mean end of NOSQL debate
e) Granted, there is lots of rhetoric going on, but that is hardly a basis for sound theory or for that matter practice. For example a blog [http://blogs.forrester.com/appdev/2010/02/nosql.html] talks about NoSQL. Have no clue why folks write about things without researching through and that is a problem. In fact the blogger is going to write a report - am not sure if it is a threat or a promise ;o)
f) From my perspective the NOSQL has value in terms of scalability, internet scale distributability, resilience (the requirement of which is a by product of the internet infrastructure) ... I am not sure one would run Hadoop over Oracle (in it's current form. I am aware that database vendors are adding the mapReduce capability)
g) There is the NOSQL crowd and the NoSQL crowd, which of course, are not the same ...

Cheers
<k/>

> --
> ~~~~~
> Register Today for Cloud Slam 2010 at official website - http://cloudslam10.com
> Posting guidelines: http://groups.google.ca/group/cloud-computing/web/frequently-asked-questions
> Follow us on Twitter http://twitter.com/cloudcomp_group or @cloudcomp_group
> Post Job/Resume at http://cloudjobs.net

> Buy 88 conference sessions and panels on cloud computing on DVD at http://www.amazon.com/gp/product/B002H07SEC, http://www.amazon.com/gp/product/B002H0IW1U or get instant access to downloadable versions at http://cloudslam09.com/content/registration-5.html

Jim Starkey

unread,

Feb 25, 2010, 5:21:54 PM2/25/10

to cloud-c...@googlegroups.com

Greg Pfister wrote:

Interesting and useful, Jim. I've always been leery of CAP, myself.

One question/comment:

Isn't it the case that your counterexample actually is serializable,
just not locally (with transaction) serializable? Instead, it's
serializable at the transaction level.

Here's a proof that it isn't serializable:

Assume a single table with a single numeric column.
Assume a single transaction type T that counts the number of rows and stores a new row with that count.
Two concurrent transactions start against an empty table. Each stores the number 0.
If the system were serializable, there could not be duplicate values. QED

If your really wanted to enforce unique values for the column, you could put a unique index on it which would cause one of the concurrent transactions to fail.

MVCC (multi-version concurrency control) has always been consistent but non-serializable (Bill Noyce, once of DEC, gets full credit for the proof). Apply MVCC to the cloud, add agents to serialize updates of discrete elements, and magic happens.

Jim Starkey

unread,

Feb 25, 2010, 5:41:21 PM2/25/10

to cloud-c...@googlegroups.com

Krishna Sankar wrote:

Jim,
	Interesting ... Some points:

	a)	NOSQL != !(SQL)
	b)	and so, !(NOSQL) != SQL
	c)	The unfortunate NOSQL term doesn't mean NoSQL (notice the capitalization). IMHO, what it says is that there are scenarios and characteristics which require less strict ACID and various flavors (ket-value store, document store, data srtucture store et al)
	d)	Hence, saying CAP theorem is false does not mean end of NOSQL debate
	e)	Granted, there is lots of rhetoric going on, but that is hardly a basis for sound theory or for that matter practice. For example a blog [http://blogs.forrester.com/appdev/2010/02/nosql.html] talks about NoSQL. Have no clue why folks write about things without researching through and that is a problem. In fact the blogger is going to write a report - am not sure if it is a threat or a promise ;o)
	f)	From my perspective the NOSQL has value in terms of scalability, internet scale distributability, resilience (the requirement of which is a by product of the internet infrastructure) ... I am not sure one would run Hadoop over Oracle (in it's current form. I am aware that database vendors are adding the mapReduce capability) 
	g)	There is the NOSQL crowd and the NoSQL crowd, which of course, are not the same ...

Can we compromise on "NoSequel"?

The common denominator is not the absence of SQL but the absence of atomic transactions, which makes SQL untenable (MySQL and Monty haveno problem with SQL without transactions, but that's another story). Key/value stores compensate with arbitrarily complex "rows", which allows them to package everything associated with a shopping cart transaction in a single row.

I don't think there is any user demand for key/value stores or Map/Reduce data stores (exception, maybe business analytics). There is user demand to data stores that scale and provide geographically disperse redundancy. If these requirements can be met with relational databases, NoSequel has almost nothing to offer.

I'm beginning to enjoy saying that NimbusDB is hardware fault tolerant, software fault tolerant, and geological fault tolerant. Haven't decided whether human fault tolerant goes in V1 or V2.

Jim Starkey

unread,

Feb 25, 2010, 5:44:15 PM2/25/10

to cloud-c...@googlegroups.com

Stuart Charlton wrote:
> Jim,
>
> I had thought the whole point of the CAP theorem was to describe
> fundamental tradeoffs. No one that truly understands the implications
> of that theorem would argue that relaxing C would enable improves
> levels of both A and P. This is what I tried to explain in my
> "Designing for the Cloud" slides:
> http://www.slideshare.net/StuC/oopsla-cloud-workshop-designing-for-the-cloud-elastra
>

My point is that there does not need to be a tradeoff at all. It is
perfectly possible to construct a system that is consistent, available,
and partition tolerant. It just can't be serializable.

My slightly deeper point is that "database theory" is based as much on
folk lore than actual provable results. The consequences of this is
that progress in computing -- particularly in cloud computing, which has
these issues in spades -- has been unnecessarily retarded by shallow
thinking. The case in point is database scalability (insert shameless
plug for talk here). If you conflate consistency with serializability
you can prove that a single database can't be run on multiple computers
with those computers executing in lock step. If you are willing to
knock over a sacred cow, it is apparent that a database can scale
elastically to meet an arbitrary demand and can be geographically
disperse, each of which the CAP conjecture ('tain't a theorem no more)
says can't be done.

This is the point in history where database systems leave the rails and
start to fly.

>
> In other words, it's not that the conjecture is false, it's that
> people need to check their assumptions in applying it. It's not a
> black and white scenario where consistency kills you, full stop.

It is black and white. The CAP conjecture is, in fact, false. The SAP
conjecture is a different story, but who should care about
serializability if consistency can be achieved other ways? If you don't
have transactions that can be rolled back, I'll concede some validity
for the CAP conjecture, but then in the absence of transactions it's
next to impossible to achieve consistency anyway. But on it's face,
the CAP conjecture is just wrong.

Sassa

unread,

Feb 25, 2010, 5:56:14 PM2/25/10

to Cloud Computing

It may depend on a difference in what C, A and P mean to you.

I don't see serialization as a required implementation of
transactional consistency.

If P(artitioning) means loss of communication between transaction
participants, the transaction can only be rolled back and a new one
cannot be started. If you can run transactions without requiring the
partitions to be consistent (i.e. others to know you are having a
transaction), you aren't partitioned.

I think MVCC or not, all parts of the system need to agree
(=Consistency) on the values of the attributes being updated, before
commit ends.

So starting new transactions can work. But how do you commit without
propagating the knowledge about this among interested but partitioned
parties?

Sassa

On Feb 25, 7:38 pm, Jim Starkey <jstar...@nimbusdb.com> wrote:
> Brewer's CAP conjecture is that a networked system cannot be consistent,
> available, and partition tolerant. A long description can be found at
>
> http://www.julianbrowne.com/article/viewer/brewers-cap-theorem
>
> and a purported "proof" at
>

> http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.20.1495&rep=...

Gabriel Mateescu

unread,

Feb 25, 2010, 7:12:22 PM2/25/10

to cloud-c...@googlegroups.com

On Thu, Feb 25, 2010 at 2:38 PM, Jim Starkey <jsta...@nimbusdb.com> wrote:
> Brewer's CAP conjecture is that a networked system cannot be consistent,

> available, and partition tolerant. [...]

> The CAP conjecture, I am convinced, is false and can be proven false.
>
> The counterexample to the conjecture and the flaw in the "proof" is that
> each assumes that a system needs to be serializable to be consistent.

I guess, it depends on how you define consistency.

In my opinion, if C in CAP stands for Strong Consistency,
then the CAP theorem is correct. However, for relaxed
forms of consistency, e.g., eventual consistency,

http://en.wikipedia.org/wiki/Eventual_consistency

all three properties could be true at the same time.

Gabriel

Miha Ahronovitz

unread,

Feb 25, 2010, 11:08:47 PM2/25/10

to cloud-c...@googlegroups.com

Gabriel,

This is clearest contribution to this thread, which probably is familiar to about 300 hundred people on this group. This means 9,000 people read this emails "ca mitza in calendar" (Romanian) which means they are completely outside the topic.
Engineering is about compromises. So (reading from the wikipedia link you provided),if we accept BASE (Basically Available, Soft state, Eventual consistency), as opposed to the database concept of ACID ((atomicity, consistency, isolation, durability) we Consistency too

Now has this to do with the clouds? I think what Jim wants to say that his idea of a Nimbus data base that one can upload to the cloud, and automatically joins seamlessly all other Nimbus DB uploaded by thousands others to the cloud, and distributes itself on available servers (Scalable) and has Partition Tolerance( if one or more servers fail, also can be Consistent is we accept the "eventual consistencies" like BASE instead of ACID consistency.)

That means there is a solution. Sure banks want ACID, bur there must be a market, huge, outside finance that will tolerate BASE. That means Nimbus has a theoretical future

How big is this market BASE market?

Miha

From: Gabriel Mateescu <gabriel....@gmail.com>
To: cloud-c...@googlegroups.com
Sent: Thu, February 25, 2010 4:12:22 PM
Subject: Re: [ Cloud Computing ] Brewer's CAP Conjecture is False

Jeff Darcy

unread,

Feb 26, 2010, 12:12:51 AM2/26/10

to cloud-c...@googlegroups.com

Jim Starkey wrote:
> Brewer's CAP conjecture is that a networked system cannot be consistent,
> available, and partition tolerant. A long description can be found at
>
> http://www.julianbrowne.com/article/viewer/brewers-cap-theorem
>
> and a purported "proof" at
>
> http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.20.1495&rep=rep1&type=pdf
>
> The CAP conjecture, I am convinced, is false and can be proven false.

Nice to see you're finally reading Lynch, Jim, only a month or two after
you clearly had no idea who she was or what the formal underpinnings of
CAP were. Unfortunately, you're a little premature in declaring victory
here. Refuting or limiting a proof is not the same as proving its
conclusion is false. If I say that 2+2=4 on Mondays, and you point out
that my proof is inapplicable six days of the week, you still haven't
proven that 2+2 equals anything but four on Tuesday through Sunday.
Similarly, even if Gilbert and Lynch's proof applies only to
serializable consistency, that does not preclude a separate and more
general proof of CAP. Most branches of mathematics are full of examples
where a theorem was proven for successively more general cases before it
was proven in its entirety.

Turning now to the substance of your supposed refutation, it seems not
to address the issue of when a transaction terminates - and it must
terminate in all cases to satisfy the Gilbert/Lynch definition of
availability (section 2.2). As soon as you need to ensure that
transactions terminate, you need to have a mechanism to coordinate
rollback of one or more transactions that might span multiple of your
"deterministic resolution agents" even in the presence of arbitrary
partitions. I'm sure many in the computer science community eagerly
await your explanation of how to solve that particular problem.

That's only the first problem that I noticed. I'll be glad to point out
more if you'd like. Someone might still manage to prove that there is
some useful set of definitions for which CAP is provably false, but it's
going to take an awful lot more than an "MVCC is magic" hand-wave. Be
careful you don't replace the C in CAP with Completely Ridiculous. ;)

Stuart Charlton

unread,

Feb 26, 2010, 1:17:11 AM2/26/10

to cloud-c...@googlegroups.com

On Thu, Feb 25, 2010 at 2:44 PM, Jim Starkey <jsta...@nimbusdb.com> wrote:

Stuart Charlton wrote:

Jim,

I had thought the whole point of the CAP theorem was to describe fundamental tradeoffs. No one that truly understands the implications of that theorem would argue that relaxing C would enable improves levels of both A and P. This is what I tried to explain in my "Designing for the Cloud" slides: http://www.slideshare.net/StuC/oopsla-cloud-workshop-designing-for-the-cloud-elastra

My point is that there does not need to be a tradeoff at all. It is perfectly possible to construct a system that is consistent, available, and partition tolerant. It just can't be serializable.

OK, but that's what I thought section 4.4 of Gilbert & Lynch's proof was talking about, when they bring up Weaker Consistency in a Partially Synchronous case. You're absolutely right that you can have all three properties, if you're willing to tolerate consistent, but stale data.

I think the counter-example you originally give doesn't actually prove your point, if I'm reading it correctly. Your point #2, hurts the "A" in CAP, since you're preventing a transaction from committing because it might introduce inconsistency. Now, there are normal database integrity reasons for preventing an inconsistent update (optimistic concurrency checks, for example), which are fine. But, if my database chooses to prevent inconsistent updates because of a partition, I'm reducing availability for the sake of consistency -- I've chosen C & P at the expense of A.

On the other hand, what I think WOULD fit your broader point would be to consider an MVCC case with some kind of snapshot (not serializable) isolation. In that case, the client might only be able to access stale, but consistent data, relative to some previous snapshot epoch.

This case is what I think you're going after -- you relax C to snapshot-level consistency, to enable both Availability (my transactions will commit because they were based on a known snapshot), and Partition tolerance (which just means the data is going to be more stale than it would be if my network was golden). The devil would be in the details, of course. This is more or less how Vertica works today, for example, and I gather one of the things Stonebraker's trying to do with VoltDB based on their 2-page teaser.

My slightly deeper point is that "database theory" is based as much on folk lore than actual provable results. The consequences of this is that progress in computing -- particularly in cloud computing, which has these issues in spades -- has been unnecessarily retarded by shallow thinking. The case in point is database scalability (insert shameless plug for talk here). If you conflate consistency with serializability you can prove that a single database can't be run on multiple computers with those computers executing in lock step. If you are willing

Agree with you generally here.

to knock over a sacred cow, it is apparent that a database can scale elastically to meet an arbitrary demand and can be geographically disperse, each of which the CAP conjecture ('tain't a theorem no more) says can't be done.

My objection is that I don't think the CAP conjecture actually says it can't be done. It's the "popular & flawed interpretation of CAP by a new generation of young developers" that might be claiming it can't be done.

There's a serious bandwagon effect on multiple fronts every time we have major technology shifts like this, and it takes years to sort out the folklore from the reality. I applaud your efforts :)

Stu

Ricky Ho

unread,

Feb 26, 2010, 10:02:42 AM2/26/10

to cloud-c...@googlegroups.com

Jim,

Interesting ! Can you define "consistency" ?

In my opinion, the whole objective of "consistency" is to make reasoning the application logic easy within a concurrent environment. Imagine our DB is a state machine and each user action is a trigger of a state transition. Without such isolation, the possible combination of user action (from different users) is a combinatorial explosion which make it almost impossible to reason the application's behavior.

To simplify our analysis, we "flatten" the combinatorial explosion by disallowing user actions (within a logical unit of work) interlacing with each other. In other words, the logical unit of work for each user is as if it has exclusive access to all states. This is the fundamental of serializability.

So I agree with you. From a pure theoretical perspective, "serializability" is a means to achieve the goal of "consistency". But practically, I don't see any other better alternatives.

Think about the difference between programming in a multi-thread environment using LOCKs, and compare that with Software Transaction Memory, you come to the same picture.

Rgds,
Ricky

----- Original Message ----
From: Jim Starkey <jsta...@nimbusdb.com>
To: cloud-c...@googlegroups.com

Sent: Thu, February 25, 2010 11:38:13 AM
Subject: [ Cloud Computing ] Brewer's CAP Conjecture is False

Brewer's CAP conjecture is that a networked system cannot be consistent,
available, and partition tolerant. A long description can be found at

http://www.julianbrowne.com/article/viewer/brewers-cap-theorem

and a purported "proof" at

http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.20.1495&rep=rep1&type=pdf

The CAP conjecture, I am convinced, is false and can be proven false.

The counterexample to the conjecture and the flaw in the "proof" is that

each assumes that a system needs to be serializable to be consistent.

Ricky Ho

unread,

Feb 26, 2010, 10:03:06 AM2/26/10

to cloud-c...@googlegroups.com

In your proof example, if you turn on "serializable" in isolation level, one of the transaction will fail.

I think MVCC is just another way of implementing Serialization, instead of using a pessimistic LOCK strategy, it use an optimistic version mechanism. In case if the version has updated, the transaction will fail, which is equivalent to the transaction has execute successfully before but overwritten by a later transaction.

Rgds,
Ricky

From: Jim Starkey <jsta...@nimbusdb.com>
To: cloud-c...@googlegroups.com

Sent: Thu, February 25, 2010 2:21:54 PM
Subject: Re: [ Cloud Computing ] Re: Brewer's CAP Conjecture is False

Greg Pfister wrote:

Interesting and useful, Jim. I've always been leery of CAP, myself.

One question/comment:

Isn't it the case that your counterexample actually is serializable,
just not locally (with transaction) serializable? Instead, it's
serializable at the transaction level.

Here's a proof that it isn't serializable:

Assume a single table with a single numeric column.
Assume a single transaction type T that counts the number of rows and stores a new row with that count.
Two concurrent transactions start against an empty table. Each stores the number 0.
If the system were serializable, there could not be duplicate values. QED

If your really wanted to enforce unique values for the column, you could put a unique index on it which would cause one of the concurrent transactions to fail.

MVCC (multi-version concurrency control) has always been consistent but non-serializable (Bill Noyce, once of DEC, gets full credit for the proof). Apply MVCC to the cloud, add agents to serialize updates of discrete elements, and magic happens.

Greg Pfister
http://perilsofparallel.blogspot.com/



On Feb 25, 12:38 pm, Jim Starkey <jstar...@nimbusdb.com> wrote:

Brewer's CAP conjecture is that a networked system cannot be consistent,
available, and partition tolerant.  A long description can be found at

     http://www.julianbrowne.com/article/viewer/brewers-cap-theorem

and a purported "proof" at

http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.20.1495&rep=...

Jim Starkey

unread,

Feb 26, 2010, 10:54:35 AM2/26/10

to cloud-c...@googlegroups.com

No need to wait. Hector Garcia Molina solved it 25 years ago. A set of machines agree on a set on non-disjoint "coteries", where each coterie is a subset of the machines in the system. After a partition event, a partition needs to contain at least one complete coterie to survive. Since there are no disjoint coteries and no machine can be in more than one partition, there is at most one surviving coterie. If no partition contains a complete coterie, everyone must shut down.

If a precommit is delivered to at least one machine in every coterie before it is considered committed, then it must have been reported in the surviving coterie. If the precommit reaches some but not all of the systems, it will be committed if the precommit reached a machine in the surviving partition, otherwise it will be rolled back.

That's only the first problem that I noticed.  I'll be glad to point out
more if you'd like.  Someone might still manage to prove that there is
some useful set of definitions for which CAP is provably false, but it's
going to take an awful lot more than an "MVCC is magic" hand-wave.  Be
careful you don't replace the C in CAP with Completely Ridiculous.  ;)

Please do. I eagerly await your reply.

Gilad Parann-Nissany

unread,

Feb 26, 2010, 11:13:05 AM2/26/10

to cloud-c...@googlegroups.com

Hi Jim

In your approach, if I get it, a transaction is broken down into multiple individually serializable sub-steps (each within a specific partition). Is that what you meant?

If so, some Qs:

1. how do you achieve consistency across these multiple sub-steps? If one fails, how do you roll back all the steps that were already committed? Are we talking here about sub-transactions?

2. is this new DB holding multiple copies of each partition (good for robustness & durability)? If so, does each sub-transaction "finish" only after all copies are updated?

Thanks.
Regards
Gilad
__________________
Gilad Parann-Nissany
CEO, Founder
Porticor Cloud Security
http://www.porticor.com/

Jeff Darcy

unread,

Feb 26, 2010, 11:21:19 AM2/26/10

to cloud-c...@googlegroups.com

On 02/26/2010 10:54 AM, Jim Starkey wrote:
> No need to wait. Hector Garcia Molina solved it 25 years ago. A set of
> machines agree on a set on non-disjoint "coteries", where each coterie
> is a subset of the machines in the system. After a partition event, a
> partition needs to contain at least one complete coterie to survive.

In other words, you give up availability. Brewer made exactly this
point on slide 16 of his 2000 PODC keynote
(http://www.cs.berkeley.edu/~brewer/cs262b-2004/PODC-keynote.pdf) where
he introduced the CAP Theorem. Congratulations, you've invented CP.

Pranta Das

unread,

Feb 26, 2010, 11:32:27 AM2/26/10

to cloud-c...@googlegroups.com

I completely agree. We live in a world where "One size does not fit all" but where consistency requirements will always be hybrid - some stringent (or serializable) others more relaxed (or loose).

If I remember my high-school chemistry well,

ACID + BASE = SALT (Somewhat Acceptable Loosely-consistent Transactions) + H2O (Isn't that what Clouds are made of?)

Cheers,
Pranta

From: Miha Ahronovitz <mij...@sbcglobal.net>
To: cloud-c...@googlegroups.com
Sent: Thu, February 25, 2010 8:08:47 PM

Jim Starkey

unread,

Feb 26, 2010, 11:44:46 AM2/26/10

to cloud-c...@googlegroups.com

Ricky Ho wrote:
> Jim,
>
>
> Interesting ! Can you define "consistency" ?
>

I don't believe there is -- or can be -- a universal definition of
consistency. Consistency must be what you dclare it to be. In a
database context, consistency is defined by the set of constraint
declarations: A set of unique constraints, a set of referential
integrity constraints, constraints on domains, etc. Other than simple
rules like "a record can't update a version of a record it couldn't
see", consistency should be explicit.

>
> In my opinion, the whole objective of "consistency" is to make reasoning the application logic easy within a concurrent environment. Imagine our DB is a state machine and each user action is a trigger of a state transition. Without such isolation, the possible combination of user action (from different users) is a combinatorial explosion which make it almost impossible to reason the application's behavior.
>

It is easy to demonstrate that a serializable database has at every
point a single definitive state. A non-serializable database, even a
consistent one, may never have a single definitive state due to message
skew of commit messages. On each node, a transaction sees the set of
transactions that have been reported committed, but since commit
messages are delivered asynchronously, concurrent transactions on
different nodes will see different sets of committed transactions, hence
no definitive database state.

But given that there isn't a single definitive state, the model of state
machine doesn't quite apply.

> To simplify our analysis, we "flatten" the combinatorial explosion by disallowing user actions (within a logical unit of work) interlacing with each other. In other words, the logical unit of work for each user is as if it has exclusive access to all states. This is the fundamental of serializability.
>
> So I agree with you. From a pure theoretical perspective, "serializability" is a means to achieve the goal of "consistency". But practically, I don't see any other better alternatives.
>

MVCC allows a transaction to a stable view of a volatile database,
subject to the rule that a transaction can't modify a version of record
it didn't see. MVCC isn't serializable, but can be made to enforce any
arbitrary definition of consistency.

Murray Spork

unread,

Feb 26, 2010, 12:18:55 PM2/26/10

to cloud-c...@googlegroups.com

On 2/25/10 1:59 PM, "Krishna Sankar" <ksan...@gmail.com> wrote:

> f) From my perspective the NOSQL has value in terms of scalability, internet
> scale distributability, resilience (the requirement of which is a by product
> of the internet infrastructure) ... I am not sure one would run Hadoop over
> Oracle (in it's current form. I am aware that database vendors are adding the
> mapReduce capability)

Aren't we missing the other big motivations for NOSQL?: extreme low latency
and extreme low cost (both update/insert cost and the cost of performing
aggregations/ queries) at massive scale - which I think is orthogonal to
CAP.

If Jim can give me those things as well as solving the CAP problem - then
sign me up!

--
Murray

Ricky Ho

unread,

Feb 26, 2010, 12:45:00 PM2/26/10

to cloud-c...@googlegroups.com

Jim,

Can you explain why MVCC is not serializable ?

In MVCC, every transaction maintaining a private copy of all data it has read and update. At commit time, the transaction submit the versions of this read/write set and compare the version stored in the DB. Commit will be successful in case the version matches, otherwise the transaction fail.

Lets say there are two concurrent transactions T1 and T2 overlap in time, and lets say T1 commit first (of course it succeed). After that T2 commits, it will only succeed if its (Read Union Write Set) does not intersect with T1's Write Set.

If T2 succeed, then it is equivalent to T2 starts AFTER T1 commits, which is equivalent to some serial transaction order, either T1, T2 or T2, T1

If T2 fails, then T2 need to restart the transaction and read the latest update from T1, which is equivalent to T1, T2.

So it is equivalent to some serial order of transaction execution, why is MVCC not serializable ?

Stuart Charlton

unread,

Feb 26, 2010, 2:43:27 PM2/26/10

to cloud-c...@googlegroups.com

Ricky,

There's a (somewhat) insightful description of the serialization anomalies that can arise with MVCC here:
http://en.wikipedia.org/wiki/Snapshot_isolation

Which is contains a description similar to one of Jim's prior posts, and hints that there may be new developments on this front.

IThe classic paper on the topic is here, which offers more detail....
http://www.cs.umb.edu/~poneil/iso.pdf

Stu

Stuart Charlton

unread,

Feb 26, 2010, 3:08:40 PM2/26/10

to cloud-c...@googlegroups.com

On 2010-02-26, at 9:18 AM, Murray Spork wrote:

> On 2/25/10 1:59 PM, "Krishna Sankar" <ksan...@gmail.com> wrote:
>
>> f) From my perspective the NOSQL has value in terms of scalability, internet
>> scale distributability, resilience (the requirement of which is a by product
>> of the internet infrastructure) ... I am not sure one would run Hadoop over
>> Oracle (in it's current form. I am aware that database vendors are adding the
>> mapReduce capability)
>
> Aren't we missing the other big motivations for NOSQL?: extreme low latency
> and extreme low cost (both update/insert cost and the cost of performing
> aggregations/ queries) at massive scale - which I think is orthogonal to
> CAP.

No, CAP is all about the tradeoffs involved with managing data at massive scale.

The use of SQL, however, is completely orthogonal to massive scalability. As is the use or avoidance of joins.

There's plenty of evidence, for example, that aggregations & queries, MPP data warehouses are scalable and fast, and Hadoop really doesn't carry a performance edge. On the other hand, that's ignoring the relative work & time to do ETL to the warehouse vs. just writing Map & Reduce functions on flat data sets in HDFS.

For extreme low latency, I could see an argument against SQL or joins, yes. But you'd have to define "extreme". Oracle might take 100-400ms do a write transaction, 50-250ms for a read, with full consistency, joins, etc. Cassandra might do 25ms for a read, maybe < 1ms for a write, but with relaxed consistency and key/value lookup. Which is better depends on your requirements.

That brings us to "Low Cost", where I would paraphrase Jamie Zawinski: "open source is only free if your time has no value". For a startup, it's a no-brainer to use lower cost databases. For an enterprise, I think it's a more complicated question.

It's unfortunate that the punditry has conflated all of this. It's completely possible to build an RDBMS for massive scale, it just would look differently, and the SQL would feel somewhat differently. It wouldn't be as "ACID" as the sense we think of today - there would need to be more settings on the isolation knob than we have today.

Cheers
Stu

Jim Starkey

unread,

Feb 26, 2010, 3:43:37 PM2/26/10

to cloud-c...@googlegroups.com

Ricky Ho wrote:

Jim,

Can you explain why MVCC is not serializable ?

In MVCC, every transaction maintaining a private copy of all data it has read and update.  At commit time, the transaction submit the versions of this read/write set and compare the version stored in the DB.  Commit will be successful in case the version matches, otherwise the transaction fail.

That's not the usual way for MVCC to work. There are alternatives (mostly used in record locking schemes), but the original design was to keep a list of record versions each tagged with the id of the transaction that created that version. When a transaction starts, it takes note of what transactions have been committed. When a transaction visits a record, it traverses the version list until it finds a record that either it created or was committed when it started. When a transaction wants to update or delete a record, it checks that the most recent version was either previously updated by the transaction or committed when the transaction began. Otherwise, it either reports an update conflict or, more reasonably, waits for an intervening transaction to end, and reports an error only if the other transaction committed.

What you describe is not MVCC but optimistic concurrency control. They are not the same.

Lets say there are two concurrent transactions T1 and T2 overlap in time, and lets say T1 commit first (of course it succeed).  After that T2 commits, it will only succeed if its (Read Union Write Set) does not intersect with T1's Write Set.

If T2 succeed, then it is equivalent to T2 starts AFTER T1 commits, which is equivalent to some serial transaction order, either T1, T2 or T2, T1

If T2 fails, then T2 need to restart the transaction and read the latest update from T1, which is equivalent to T1, T2.

So it is equivalent to some serial order of transaction execution, why is MVCC not serializable ?

I described the example that showed MVCC wasn't serializable in a previous post, but here's the short version:

Single table with one numeric column.
Single transaction type that inserts a row reflecting the number of rows it could see.
A serializable system will have rows with all values from one to <n>
Starting with an empty table, two concurrent MVCC transactions will each store the value zero.

The optimistic concurrency control scheme that you describe will have the same result as MVCC and is also non-serializable. But it's still a different animal.

Jim Starkey

unread,

Feb 27, 2010, 12:38:42 PM2/27/10

to cloud-c...@googlegroups.com

Happily, not. The Gilbert and Lynch paper define availability:

For a distributed system to be continuously available, every request received by a non-failing node in the system must result in a response.

In my counter-example, a non-failing node in the surviving partition can respond with a success or failure, as appropriate, and a node in a non-surviving partition return failure. A key element of the partition resolution algorithm is that a node can determine whether it is a member of a surviving partition or not, and respond accordingly.

I'm afraid you may have confused the formal definition of availability with the concept of "no node left behind." It's ok to leave a node behind as long as it knows it's been left out in the cold. Logically, this must be the case because two non-communicating partitions cannot operate consistently no matter how consistent is defined.

-- 
Jim Starkey
NimbusDB, Inc.
978 526-1376

Ricky Ho

unread,

Feb 27, 2010, 10:53:19 AM2/27/10

to cloud-c...@googlegroups.com

Thanks for elaboration.

The MVCC mechanism your describe is not serializable and also will have problem since it only checks the overlapping of WRITE set but not the READ set.

If a transaction read X and set Y = X + 1, and then commit. It should fail if X has been modified and commit by other transaction in between. But the MVCC in your description doesn't catch this problem because there is no transaction modifies Y in between.

The key idea is that the transaction should mark the "version" of not just what it has written, but also "what it depends on". (in my above example, this is X)

In your example, the "record count" is what the transaction depends on, but unfortunately this is not a shared record in the DB for detecting conflict. To make this "dependent data" explicit, I would create a special record for this "global_count". So the transaction will update this global_count, and add the record before commit. So your MVCC transaction will like this ..

start transaction
   count = 0
   for each record in DB
      count ++
   insert new record <id, count>
   update global_count = global_count + 1   <--- Look at this line
end

Rgds,
Ricky

From: Jim Starkey <jsta...@nimbusdb.com>
To: cloud-c...@googlegroups.com

Sent: Fri, February 26, 2010 12:43:37 PM

Jim Starkey

unread,

Feb 27, 2010, 2:35:47 PM2/27/10

to cloud-c...@googlegroups.com

Ricky Ho wrote:

Thanks for elaboration.

The MVCC mechanism your describe is not serializable and also will have problem since it only checks the overlapping of WRITE set but not the READ set.

The entire concept of read set is problematic and most likely deterministic. There are two problems. The first is what is a read? Is it a record that didn't match the selection criterion in the read set or not? If so, what about a record that wasn't read because of an index optimization? The second problem is phantoms -- records that weren't read because they hadn't been inserted yet.

Optimistic concurrency control doesn't actually work very well. It can't, for example, detect ABA problems. I've seen it used by applications layers to give some semblance of consistency to a non-transactional data store, but I know of no database implementation that relies on it.

If a transaction read X and set Y = X + 1, and then commit. It should fail if X has been modified and commit by other transaction in between. But the MVCC in your description doesn't catch this problem because there is no transaction modifies Y in between.

Ricky, the point is, there is no problem. Serialization is not required for consistency. In the "Noyce" example, nothing in the database definition declared that values in that table had to be unique, so nothing was violated with duplicate values. If the definition of consistency required that values be unique, it could be easily enforced.

The key idea is that the transaction should mark the "version" of not just what it has written, but also "what it depends on". (in my above example, this is X)

In your example, the "record count" is what the transaction depends on, but unfortunately this is not a shared record in the DB for detecting conflict. To make this "dependent data" explicit, I would create a special record for this "global_count". So the transaction will update this global_count, and add the record before commit. So your MVCC transaction will like this ..

start transaction
   count = 0
   for each record in DB
      count ++
   insert new record <id, count>
   update global_count = global_count + 1   <--- Look at this line
end

As I said above, declaring a unique constraint for the column is a better solution. More specifically, a global count creates a hot spot that any database developer will tell you is an application level performance disaster. Unique keys are better, and, if application semantics allow, sequences are much better.

The fundamental point, however, is to separate the concepts of consistency and transaction serializability. Serializability may imply consistency, but consistency does not imply serializability. However interesting this may be theoretically, from a practical perspective it enables a cloud scale ACID database, and with it, the architectural basis for the larger vision of cloud computing.

-- 
Jim Starkey
NimbusDB, Inc.
978 526-1376

Ricky Ho

unread,

Feb 27, 2010, 5:38:35 PM2/27/10

to cloud-c...@googlegroups.com

Jim,

I agree with you that "consistency" can be defined in a very app-specific way and serializability is a means to achieve this. Also they are not equivalent from a pure theoretical viewpoint. In practice, "serializability" is the only way to achieve "consistency" in a generic way. Looking at consistency in a case by case basis can be high risk and very error-prone.

But I do agree with you in theory.

Rgds,
Ricky

From: Jim Starkey <jsta...@nimbusdb.com>
To: cloud-c...@googlegroups.com

Sent: Sat, February 27, 2010 11:35:47 AM

Jim Starkey

unread,

Feb 27, 2010, 6:26:06 PM2/27/10

to cloud-c...@googlegroups.com

Ricky Ho wrote:

Jim,

I agree with you that "consistency" can be defined in a very app-specific way and serializability is a means to achieve this. Also they are not equivalent from a pure theoretical viewpoint. In practice, "serializability" is the only way to achieve "consistency" in a generic way. Looking at consistency in a case by case basis can be high risk and very error-prone.

But I do agree with you in theory.

Let me explain why serializability is a bad thing.

Serializability means that any mix of transactions has the result of having been executed in some order. This means that at every giving instant, a database has a single definitive state. This absolutely precludes running the database across multiple nodes without forcing the nodes to execute in lock step (as in Oracle RAC), eliminating throughput benefit of parallel execution. (Cluster database systems require synchronization with a distributed lock manager.)

Basically, there is no possibility of building a serializable, cloud scale database system.

Get rid of serializability, and it's a new world.

Jeff Darcy

unread,

Feb 27, 2010, 8:34:48 PM2/27/10

to cloud-c...@googlegroups.com

Jim Starkey wrote:
>> Happily, not. The Gilbert and Lynch paper define availability:
>
> For a distributed system to be continuously available, every request
> received by a non-failing node in the system must result in a
> response.

Had you read further than that, you would have seen that "response" here
refers to a response that is essentially positive in nature - e.g. data
for a read. This is made quite explicit in the proof of Theorem
1 in section 3.1 (emphasis mine).

"we know that the read *returns a value* by the availability requirement"

This is the common, default interpretation of "providing a response" in
the distributed-system literature. Failure indications such as you
suggest are functionally equivalent to non-responses, and are generally
treated as such.

> I'm afraid you may have confused the formal definition of
> availability with the concept of "no node left behind."

No, I'm not the one who's confused here.

If you really feel you've refuted CAP, why not start up a dialog with
Brewer, Gilbert, or Lynch? They're not hard to contact, they're the
authorities on the the definitions and distinctions involved, and they'd
surely be interested in any such result.

Dan Kearns

unread,

Feb 27, 2010, 10:57:32 PM2/27/10

to cloud-c...@googlegroups.com

On Sat, Feb 27, 2010 at 9:38 AM, Jim Starkey <jsta...@nimbusdb.com> wrote:

Jeff Darcy wrote:
On 02/26/2010 10:54 AM, Jim Starkey wrote:  
In other words, you give up availability.  Brewer made exactly this
point on slide 16 of his 2000 PODC keynote
(http://www.cs.berkeley.edu/~brewer/cs262b-2004/PODC-keynote.pdf) where
he introduced the CAP Theorem.  Congratulations, you've invented CP.
Happily, not. The Gilbert and Lynch paper define availability:

For a distributed system to be continuously available, every request received by a non-failing node in the system must result in a response.

In my counter-example, a non-failing node in the surviving partition can respond with a success or failure, as appropriate, and a node in a non-surviving partition return failure. A key element of the partition resolution algorithm is that a node can determine whether it is a member of a surviving partition or not, and respond accordingly.

That's a fairly non-suitable interpretation of availability for a lot of scenarios (obviously including those which spawned ec).

If everything works out (I'm having a bit of trouble with the business plan), I'll soon release FailDB: all the scalability, availability and consistency you could possibly want, with an industry-first zero-footprint server install!

-d

Ricky Ho

unread,

Feb 27, 2010, 8:40:24 PM2/27/10

to cloud-c...@googlegroups.com

I think we are talking about serialization between transactions that have shared data. Transactions that are accessing different set of data can definitely proceed in parallel in different machines without giving up the "serializability".

In fact, this is what is happening today. In any distributed system, people try to avoid transaction spanning across multiple system. They do this by carefully partition data such that transaction only happens within the same machine. So they are using multiple system in parallel, scalable and not giving up serializability.

If we get rid of "serializability", then we need to look at what "consistent" means in a case-by-case basis. This will make the application design very error-prone, also very difficult to debug and reason in my opinion.

But are we talking about the same thing using different terms ? We know that the traditional transaction processing model gives up availability for consistency (or serializability) and this won't work in a distributed environment. So we know we need to do the opposite, which is to give up consistency (or serializability) for availability. So we use the term "serializability", but not "consistency" so we all feel better ?

Rgds,
Ricky

From: Jim Starkey <jsta...@nimbusdb.com>
To: cloud-c...@googlegroups.com

Sent: Sat, February 27, 2010 3:26:06 PM

Subject: Re: [ Cloud Computing ] Brewer's CAP Conjecture is False

--

Jim Starkey

unread,

Feb 28, 2010, 11:40:13 AM2/28/10

to cloud-c...@googlegroups.com

If availability, as you argue, requires that a single node that has lost all communication to network must continue to operate in isolation, then the CAP conjecture/theorem is so shallow and trivial to have any value or applicability whatsoever, and is reduced to "there is no consistency without communication" (which, incidentally, was my statement that your originally objected to). Well, Jeff, duh.

But do note that since no database systems guarantee that any transaction will successfully complete, no database system, even one running on a single system, can ever meet your definition of availability.

In a more serious vein, it is certainly possible to design systems that:

Remain consistent in the face of any possible network failure.
Provide availability to the degree allowed by mathematics
Allow intelligent design to enable the survivability of the most beneficial of possible partitions.

This is the applicability of consistency, availability, and partition tolerance: rejecting the trivial and building useful systems. The idea that we should build systems that can tolerate a loss of half the the network because a orphan node couldn't operate is pretty dumb.

Jeff Darcy

unread,

Feb 28, 2010, 8:31:15 PM2/28/10

to cloud-c...@googlegroups.com

Jim Starkey wrote:
> If availability, as you argue, requires that a single node that has
> lost all communication to network must continue to operate in
> isolation, then the CAP conjecture/theorem is so shallow and trivial
> to have any value or applicability whatsoever, and is reduced to
> "there is no consistency without communication"

That's only one part (the A+P => !C part) of what CAP means. If you
were to look at Dynamo papers, for example, then you'd see that it is
indeed useful in some circumstances to leave nodes running, serving
possibly stale data, in preference to shutting them (and every other
local service that depends on them) down because of quorum loss. The
expectation is generally that communication will eventually be restored,
at which time any inconsistencies from the partition period will be
detected and resolved - unambiguously, though not necessarily in any
order that strong consistency would require. Search for "eventual
consistency" to learn more.

The same reasoning also applies, by the way, to some systems that
predate Brewer's statement of CAP. I'm thinking here of distributed
filesystems, especially those such as Coda that support disconnected
operation, in which the preference for serving possibly inconsistent
data over shutting down is explicitly extended all the way down to a
single isolated node - based on the same expectation of a subsequent
reconciliation.

> But do note that since no database systems guarantee that any
> transaction will successfully complete, no database system, even one
> running on a single system, can ever meet your definition of
> availability.

Availability requires that all non-failing nodes respond to requests.
They don't have to be current responses, so they don't have to reflect
that pending transaction. Since this is what the I in ACID is all about
anyway, database systems can and very often do meet this definition of
availability.

> In a more serious vein, it is certainly possible to design systems
> that:
>

> 1. Remain consistent in the face of any possible network failure. 2.
> Provide availability to the degree allowed by mathematics 3. Allow

> intelligent design to enable the survivability of the most beneficial
> of possible partitions.
>
> This is the applicability of consistency, availability, and partition
> tolerance: rejecting the trivial and building useful systems.

I'm sure the people at Amazon, LinkedIn, Facebook, etc. who've built
systems based on explicit recognition of CAP tradeoffs would be very
surprised to hear that those systems aren't useful. Just because you
personally might not find such systems useful doesn't mean others don't.
To quote Schopenhauer:

"Every man takes the limits of his own vision for the limits of the world."

> The idea that we should build systems that can tolerate a loss of
> half the the network because a orphan node couldn't operate is pretty
> dumb.

Yeah, it sure would be, but nobody but you has suggested any such thing.
There's certainly nothing in Brewer's work, or Gilbert and Lynch's, to
suggest anything like it. If you think otherwise, you're confused.

Sassa

unread,

Mar 1, 2010, 8:45:30 AM3/1/10

to Cloud Computing

On Feb 26, 8:43 pm, Jim Starkey <jstar...@nimbusdb.com> wrote:
> Ricky Ho wrote:
> > Jim,
>
> > Can you explain why MVCC is not serializable ?
>
> > In MVCC, every transaction maintaining a private copy of all data it has read and update. At commit time, the transaction submit the versions of this read/write set and compare the version stored in the DB. Commit will be successful in case the version matches, otherwise the transaction fail.
>
> That's not the usual way for MVCC to work. There are alternatives
> (mostly used in record locking schemes), but the original design was to
> keep a list of record versions each tagged with the id of the
> transaction that created that version. When a transaction starts, it
> takes note of what transactions have been committed. When a transaction
> visits a record, it traverses the version list until it finds a record
> that either it created or was committed when it started. When a
> transaction wants to update or delete a record, it checks that the most
> recent version was either previously updated by the transaction or
> committed when the transaction began. Otherwise, it either reports an
> update conflict or, more reasonably, waits for an intervening
> transaction to end, and reports an error only if the other transaction
> committed.

This scheme relies on the list of versions to be a complete list of
all transactions that occurred in the distributed system.

Does it work when some members don't report they've started/committed
a transaction? (=partitioned) If it does, then your distributed MVCC
DB should work with all nodes operating independently for an
indefinite amount of time. Which basically means that your distributed
DB doesn't need to communicate with any of its members. Ever. And
still be consistent. I find this really hard to prove. Can you
elaborate?

Sassa

> What you describe is not MVCC but optimistic concurrency control. They
> are not the same.> Lets say there are two concurrent transactions T1 and T2 overlap in time, and lets say T1 commit first (of course it succeed). After that T2 commits, it will only succeed if its (Read Union Write Set) does not intersect with T1's Write Set.
>
> > If T2 succeed, then it is equivalent to T2 starts AFTER T1 commits, which is equivalent to some serial transaction order, either T1, T2 or T2, T1
>
> > If T2 fails, then T2 need to restart the transaction and read the latest update from T1, which is equivalent to T1, T2.
>
> > So it is equivalent to some serial order of transaction execution, why is MVCC not serializable ?
>
> I described the example that showed MVCC wasn't serializable in a
> previous post, but here's the short version:
>

> 1. Single table with one numeric column.
> 2. Single transaction type that inserts a row reflecting the number

> of rows it could see.

> 3. A serializable system will have rows with all values from one to <n>
> 4. Starting with an empty table, two concurrent MVCC transactions

> >>http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.20.1495&rep=...

> >> Posting guidelines:http://groups.google.ca/group/cloud-computing/web/frequently-asked-qu...
> >> Follow us on Twitterhttp://twitter.com/cloudcomp_groupor @cloudcomp_group
> >> Post Job/Resume athttp://cloudjobs.net
> >> Buy 88 conference sessions and panels on cloud computing on DVD athttp://www.amazon.com/gp/product/B002H07SEC,http://www.amazon.com/gp/product/B002H0IW1Uor get instant access to downloadable versions athttp://cloudslam09.com/content/registration-5.html

Ken North

unread,

Mar 1, 2010, 3:57:40 PM3/1/10

to cloud-c...@googlegroups.com

Ricky Ho wrote:
>> Can you define "consistency" ?

In the transaction processing sense, consistency is related to
maintaining data integrity.

For example, assume the transaction (a logical unit of work) is a
transfer of $1000 from your savings account to your checking account.
You (the account holder) wants assurance that the system's behavior
will be consistent - for example, that identical queries run at the
same time will return the same answer and that the answer will be
correct. Neither the consumer nor the system architect wants a system
to execute a partial update that produces inconsistent data, such as
debiting the savings account for $1000 without also crediting the
checking account for the same amount.

The result of a credit card transaction could vary based on whether a
system is rigidly consistent, eventually consistent or not consistent.
If a card is reported stolen and the system is consistent, the update
of the card's status will prevent further purchases. If the system
does not enforce consistency, or is eventually consistent, other
transactions could be processed against the card for a time before the
status update ripples through the system..

Ken North

unread,

Mar 1, 2010, 4:17:30 PM3/1/10

to cloud-c...@googlegroups.com

>> The use of SQL, however, is completely orthogonal to massive
scalability. As
is the use or avoidance of joins.

>> There's plenty of evidence, for example, that aggregations &
queries, MPP data
warehouses are scalable and fast, and Hadoop really doesn't carry a
performance
edge.

The eBay data warehouse running on Greenplum is 6+ petabytes (PB). It's
an SQL solution that runs on 96 nodes.
(discussed here:
http://dobbscodetalk.com/index.php?option=com_myblog&show=Terabytes-to-Petabytes-Reflections-on-1999-2009.html&Itemid=29)

Owen O'Malley and Arun Murthy of Yahoo blogged about a project that
used 3800 nodes for Hadoop Map/Reduce to sort a petabyte in 16 hours
and a terabyte in 62 seconds.

http://developer.yahoo.net/blogs/hadoop/2009/05/hadoop_sorts_a_petabyte_in_162.html

Jim Starkey

unread,

Mar 1, 2010, 5:20:36 PM3/1/10

to cloud-c...@googlegroups.com

Sassa wrote:

On Feb 26, 8:43 pm, Jim Starkey <jstar...@nimbusdb.com> wrote:

Ricky Ho wrote:

Jim,

Can you explain why MVCC is not serializable ?

In MVCC, every transaction maintaining a private copy of all data it has read and update.  At commit time, the transaction submit the versions of this read/write set and compare the version stored in the DB.  Commit will be successful in case the version matches, otherwise the transaction fail.

That's not the usual way for MVCC to work.  There are alternatives
(mostly used in record locking schemes), but the original design was to
keep a list of record versions each tagged with the id of the
transaction that created that version.  When a transaction starts, it
takes note of what transactions have been committed.  When a transaction
visits a record, it traverses the version list until it finds a record
that either it created or was committed when it started.  When a
transaction wants to update or delete a record, it checks that the most
recent version was either previously updated by the transaction or
committed when the transaction began.  Otherwise, it either reports an
update conflict or, more reasonably, waits for an intervening
transaction to end, and reports an error only if the other transaction
committed.

This scheme relies on the list of versions to be a complete list of
all transactions that occurred in the distributed system.

No, actually it doesn't. Consistency rules are enforced by resolution agents that approve or disapprove of specific updates, maintaining consistency. A resolution agent works on a "first come, first serve" basis. An update approved by a consistency agent will be received by a commit agent before the precommit, so there is no window on inconsistency.

Does it work when some members don't report they've started/committed
a transaction? (=partitioned) If it does, then your distributed MVCC
DB should work with all nodes operating independently for an
indefinite amount of time. Which basically means that your distributed
DB doesn't need to communicate with any of its members. Ever. And
still be consistent. I find this really hard to prove. Can you
elaborate?

Resolution agents are necessary to avoid concurrent updates to the same record, unique indexes, and other declared consistency rules. Following a partition event, there three interesting cases:

A transaction is in a failing partition, in which it case it can't commit anyway.
A transaction and resolution agent are in the surviving partition, in which case there is no issue.
A transaction is in the surviving partition but the resolution agent isn't. In this case, there is sufficient information for a subsequently anointed resolution agent to reconstruct its consistency management state and go forward.

Sassa

unread,

Mar 3, 2010, 8:22:36 AM3/3/10

to Cloud Computing

That means that availability is guaranteed only in the partitions that
have a consistency agent. So scalability and partitioning resilience
of this solution depends on scalability and independence of
consistency agents.

> > Does it work when some members don't report they've started/committed
> > a transaction? (=partitioned) If it does, then your distributed MVCC
> > DB should work with all nodes operating independently for an
> > indefinite amount of time. Which basically means that your distributed
> > DB doesn't need to communicate with any of its members. Ever. And
> > still be consistent. I find this really hard to prove. Can you
> > elaborate?
>
> Resolution agents are necessary to avoid concurrent updates to the same
> record, unique indexes, and other declared consistency rules. Following
> a partition event, there three interesting cases:
>

> 1. A transaction is in a failing partition, in which it case it can't
> commit anyway.
> 2. A transaction and resolution agent are in the surviving partition,

> in which case there is no issue.

> 3. A transaction is in the surviving partition but the resolution

> agent isn't. In this case, there is sufficient information for a
> subsequently anointed resolution agent to reconstruct its
> consistency management state and go forward.

This is as obvious as CAP theorem. ;-)

The size of partition cannot be controlled. The consequences are:

* partitions need to know which one is a surviving partition and which
one is a failing partition. I think this is not trivial, if the
members are peers; and hard to scale, if the members are
hierarchically related

* availability in failing partitions is sacrificed (just as CAP or any
other expert says)

* scalability of the surviving partitions depends on the size of
partitions - we aren't talking about clouds = "pool of resources
larger than anyone will ever need" anymore

Sassa

> >>>> Buy 88 conference sessions and panels on cloud computing on DVD athttp://www.amazon.com/gp/product/B002H07SEC,http://www.amazon.com/gp/...get instant access to downloadable versions athttp://cloudslam09.com/content/registration-5.html

Jim Starkey

unread,

Mar 3, 2010, 4:59:02 PM3/3/10

to cloud-c...@googlegroups.com

Each consistency agent will have a deterministic successor node in the surviving partition with sufficient information to continue. A nice side effect to message based replication / consistency control is that the last message received from a transaction is a pre-commit message. When you've got that, you know that you've also received everything prior to it. If you don't get a commit and the originator is voted off the island, the transaction gets dropped on the floor (figuratively).

I think I sketched the partition resolution algorithm earlier, but I'll do it again. The idea is come Hector Garcia Molina: A given set of nodes agree on a set of non-disjoint coteries, where each coterie is a subset of the nodes. Each coterie represents a survivable partition, and since no two coteries are disjoint, in the case of a partition there can be at most surviving coterie. All a node has to do to determine whether is a survivor or not is to determine which nodes it can communicate with and determine whether those nodes constitute a coterie.

Not all nodes, in fact, have to be part of the scheme; it is sufficient to have enough canary nodes to have a desired set of partition characteristics. Other nodes and survivors or not depending on whether then can communicate with a surviving canary.

The minimum coterie size is half the total nodes (and not all 50% solutions work).

Yes, of course, the nodes in a non-surviving partition are lost. But the service itself remains available (if you can reach it). The point is the availability of the service, not individual nodes. Any elastic service can be rapidly spun up to full power by plugging in more machines. You can argue that a service you can't communicate with isn't much use, but that's life in a fallible world. It isn't the service that failed, it's your communications. Again: Without communication, no consistency.

The algebra of coteries is interesting. If you're interested, much has been written about it.

Greg Pfister

unread,

Mar 3, 2010, 5:36:34 PM3/3/10

to Cloud Computing

On Mar 3, 6:22 am, Sassa <sassa...@gmail.com> wrote:
> On Mar 1, 10:20 pm, Jim Starkey wrote

>
> > Resolution agents are necessary to avoid concurrent updates to the same
> > record, unique indexes, and other declared consistency rules. Following
> > a partition event, there three interesting cases:
>
> > 1. A transaction is in a failing partition, in which it case it can't
> > commit anyway.
> > 2. A transaction and resolution agent are in the surviving partition,
> > in which case there is no issue.
> > 3. A transaction is in the surviving partition but the resolution
> > agent isn't. In this case, there is sufficient information for a
> > subsequently anointed resolution agent to reconstruct its
> > consistency management state and go forward.
>
> This is as obvious as CAP theorem. ;-)
>
> The size of partition cannot be controlled. The consequences are:
>
> * partitions need to know which one is a surviving partition and which
> one is a failing partition. I think this is not trivial, if the
> members are peers; and hard to scale, if the members are
> hierarchically related

Am I missing something? Isn't this ye olde quorum determination?
Hasn't somebody somewhere laid out a scalable "count connected node
set" algorithm? Seems like it should be possible in log(N) time.

> * availability in failing partitions is sacrificed (just as CAP or any
> other expert says)

Yeah, but see note below.

> * scalability of the surviving partitions depends on the size of
> partitions - we aren't talking about clouds = "pool of resources
> larger than anyone will ever need" anymore

This is what always has confused me about CAP. What I recall of what
I've read seems to insist that all parts of a partitioned collection
still be able to satisfy all requests.

From my point of view, this seems a priori impossible for the usual
issues relating to split brain clusters. All RO requests are OK, if
you had universal replication of all data. Expensive. But while
reconciliation of some of updates may be possible after reconnection,
but some will conflict.

Which thoughts tell me why this discussion has veered into MVCC. Duh.

And also reinforce my reaction to CAP as saying "I want the impossible
and it's big news that I can't get it."

Greg Pfister
http://perilsofparallel.blogspot.com/

> ...
>
> read more »

Stuart Charlton

unread,

Mar 4, 2010, 1:34:44 PM3/4/10

to cloud-c...@googlegroups.com

On 2010-03-03, at 2:36 PM, Greg Pfister wrote:

* scalability of the surviving partitions depends on the size of
partitions - we aren't talking about clouds = "pool of resources
larger than anyone will ever need" anymore

This is what always has confused me about CAP. What I recall of what
I've read seems to insist that all parts of a partitioned collection
still be able to satisfy all requests.

That's if you choose Availability & Partition-tolerance.

For example, the Internet DNS is often used as an example of this kind of design. DNS still works for my LAN if I update it while it's partitioned from the bigger network. The bigger network can be updated while my LAN's DNS is partitioned. There isn't strong global consistency, but all updates eventually will propagate.

From my point of view, this seems a priori impossible for the usual
issues relating to split brain clusters. All RO requests are OK, if
you had universal replication of all data. Expensive. But while
reconciliation of some of updates may be possible after reconnection,
but some will conflict.

Yup, that's the price of A+P.

Databases with asynchronous log shipping , for example, I would consider another example of preferring A+P (with synchronous log shipping being a choice of C+P), though in practice there is stronger consistency with either approach than some of the extreme examples out there of A+P , arguably because the scale is smaller.

One way I've tried to explain how to think about the tradeoff is in terms of how big your "domain of control" is. When your system is small enough that you have control over the network, and can limit MTTR to very small bounds, you can relax partition tolerance and choose C+A, which is how most commercial clustered databases work (and why they often prefer to have redundant NICs, switches, etc.)..... But as you scale up, across racks, across switches, and eventually across geography & organizations, you have less control over the network failure rates and MTTR, so the software itself has to begin tolerating partitions.... and during failures, either you tell some portion of clients "sorry, can't help you right now", sacrificing availability, or you keep processing requests and have a way to prevent or handle consistency conflicts. A practical system would hopefully provide both options ...

Cheers

Stu

Jim Starkey

unread,

Mar 4, 2010, 5:42:04 PM3/4/10

to cloud-c...@googlegroups.com

Stuart Charlton wrote:

On 2010-03-03, at 2:36 PM, Greg Pfister wrote:

* scalability of the surviving partitions depends on the size of

partitions - we aren't talking about clouds = "pool of resources

larger than anyone will ever need" anymore

This is what always has confused me about CAP. What I recall of what
I've read seems to insist that all parts of a partitioned collection
still be able to satisfy all requests.

That's if you choose Availability & Partition-tolerance.

For example, the Internet DNS is often used as an example of this kind of design. DNS still works for my LAN if I update it while it's partitioned from the bigger network. The bigger network can be updated while my LAN's DNS is partitioned. There isn't strong global consistency, but all updates eventually will propagate.

Well, not exactly. Most updates eventually will propagate. Those that got clobbered will be lost. Which of several conflicting updates win is not deterministic. Also, it can be argued that a DNS server on an isolated partition is giving out wrong information.

That slop isn't good enough for a database system. If somebody in Tokyo and somebody in New York can each withdraw the last $1M from an account, there's going to be an unhappy banker who's going to think that getting the right answer trumps a funny definition of availability.

It's wrong to confuse parts of a system with the system as a whole. The fact that a number of parts of an available and functional system is irrelevant. What's important is that the system remains available. If you can't reach it, well, tough noggies.

Partition tolerant should mean that the system retains its integrity, consistency, and availability until it's chopped into such little pieces that none have the critical mass to continue. This is completely doable as long as you aren't going to insist that the laws of logic fail.

From my point of view, this seems a priori impossible for the usual
issues relating to split brain clusters. All RO requests are OK, if
you had universal replication of all data. Expensive. But while
reconciliation of some of updates may be possible after reconnection,
but some will conflict.

Yup, that's the price of A+P.

Databases with asynchronous log shipping , for example, I would consider another example of preferring A+P (with synchronous log shipping being a choice of C+P), though in practice there is stronger consistency with either approach than some of the extreme examples out there of A+P , arguably because the scale is smaller.

One way I've tried to explain how to think about the tradeoff is in terms of how big your "domain of control" is. When your system is small enough that you have control over the network, and can limit MTTR to very small bounds, you can relax partition tolerance and choose C+A, which is how most commercial clustered databases work (and why they often prefer to have redundant NICs, switches, etc.)..... But as you scale up, across racks, across switches, and eventually across geography & organizations, you have less control over the network failure rates and MTTR, so the software itself has to begin tolerating partitions.... and during failures, either you tell some portion of clients "sorry, can't help you right now", sacrificing availability, or you keep processing requests and have a way to prevent or handle consistency conflicts. A practical system would hopefully provide both options ...

The serious and interesting questions are the requirements for a geographically disperse database system. You need the bandwidth to handle the replication, certainly. And an acceptable latency on the long haul parts so commit latency is acceptable. Also redundant storage at every site. Practically speaking, you probably also need to have the bulk of the updates be effectively local to a geographical region to minimize message traffic. But it's doable as long as you don't get hung up on the CAP conjecture.

Stuart Charlton

unread,

Mar 5, 2010, 1:41:54 AM3/5/10

to cloud-c...@googlegroups.com

On 2010-03-04, at 2:42 PM, Jim Starkey wrote:

For example, the Internet DNS is often used as an example of this kind of design. DNS still works for my LAN if I update it while it's partitioned from the bigger network. The bigger network can be updated while my LAN's DNS is partitioned. There isn't strong global consistency, but all updates eventually will propagate.

Well, not exactly. Most updates eventually will propagate. Those that got clobbered will be lost. Which of several conflicting updates win is not deterministic. Also, it can be argued that a DNS server on an isolated partition is giving out wrong information.

Completely agree. Stale information = relaxed consistency. Though in this case, it's better to serve up cached "old" IP addresses that might be wrong, than nothing at all.

That slop isn't good enough for a database system. If somebody in Tokyo and somebody in New York can each withdraw the last $1M from an account, there's going to be an unhappy banker who's going to think that getting the right answer trumps a funny definition of availability.

It isn't slop at all. There are plenty of database systems (in banking!) that are a day+ behind, fed by batch files. That's another example of relaxed consistency. This of course can lead to data quality problems if you're doing this in your OLTP systems without careful controls, but it's more acceptable for OLAP or Data Marts.

It's wrong to confuse parts of a system with the system as a whole. The fact that a number of parts of an available and functional system is irrelevant. What's important is that the system remains available. If you can't reach it, well, tough noggies.

Partition tolerant should mean that the system retains its integrity, consistency, and availability until it's chopped into such little pieces that none have the critical mass to continue. This is completely doable as long as you aren't going to insist that the laws of logic fail.

Well, as I've said before, it's possible to build a RDBMS that balances consistency, availability, and partition-tolerance, and plays with probabilities of what sort of failures occur & need to be tolerated. This likely would require some layering, e.g. local clusters that don't tolerate partitions , wide-area replication that does tolerate it such as in the case of asynchronous log shipping. But it doesn't refute the CAP tradeoffs, those tradeoffs remain a useful explanation of constraints that can help to justify certain design decisions.

And in the end, all that matters is that the customer gets the scalability & consistency they want in their database system, they won't give a whit about CAP.

One way I've tried to explain how to think about the tradeoff is in terms of how big your "domain of control" is. When your system is small enough that you have control over the network, and can limit MTTR to very small bounds, you can relax partition tolerance and choose C+A, which is how most commercial clustered databases work (and why they often prefer to have redundant NICs, switches, etc.)..... But as you scale up, across racks, across switches, and eventually across geography & organizations, you have less control over the network failure rates and MTTR, so the software itself has to begin tolerating partitions.... and during failures, either you tell some portion of clients "sorry, can't help you right now", sacrificing availability, or you keep processing requests and have a way to prevent or handle consistency conflicts. A practical system would hopefully provide both options ...

The serious and interesting questions are the requirements for a geographically disperse database system. You need the bandwidth to handle the replication, certainly. And an acceptable latency on the long haul parts so commit latency is acceptable. Also redundant storage at every site. Practically speaking, you probably also need to have the bulk of the updates be effectively local to a geographical region to minimize message traffic. But it's doable as long as you don't get hung up on the CAP conjecture.

Well, I agree it's doable, and I don't think people should get hung up on the conjecture (it can lead to analysis paralysis). But for many it's a useful way of thinking of tradeoffs, particularly in this age of challenging the status quo in the database world.

Stu

Greg Pfister

unread,

Mar 5, 2010, 9:40:33 PM3/5/10

to Cloud Computing

>There are plenty of database systems (in banking!) that are a day+ behind, fed by batch files. That's another example of relaxed consistency. This of course can lead to data quality problems if you're doing this in your OLTP systems without careful controls, but it's more acceptable for OLAP or Data Marts.

Isn't the day+ behind what's called as "pending" transactions? I'd
argue that's a different case entirely, one where the underlying
semantics are defined to be *consistent* with real-world physical
delays.

(Or with banking float-for-a-profit practices, but either way.)

Greg Pfister
http://perilsofparallel.blogspot.com/

Sassa

unread,

Mar 8, 2010, 1:01:29 PM3/8/10

to Cloud Computing

On Mar 3, 10:36 pm, Greg Pfister <greg.pfis...@gmail.com> wrote:
> On Mar 3, 6:22 am, Sassa <sassa...@gmail.com> wrote:
>
> > On Mar 1, 10:20 pm, Jim Starkey wrote
>
> > > Resolution agents are necessary to avoid concurrent updates to the same
> > > record, unique indexes, and other declared consistency rules. Following
> > > a partition event, there three interesting cases:
>
> > > 1. A transaction is in a failing partition, in which it case it can't
> > > commit anyway.
> > > 2. A transaction and resolution agent are in the surviving partition,
> > > in which case there is no issue.
> > > 3. A transaction is in the surviving partition but the resolution
> > > agent isn't. In this case, there is sufficient information for a
> > > subsequently anointed resolution agent to reconstruct its
> > > consistency management state and go forward.
>
> > This is as obvious as CAP theorem. ;-)
>
> > The size of partition cannot be controlled. The consequences are:
>
> > * partitions need to know which one is a surviving partition and which
> > one is a failing partition. I think this is not trivial, if the
> > members are peers; and hard to scale, if the members are
> > hierarchically related
>
> Am I missing something? Isn't this ye olde quorum determination?
> Hasn't somebody somewhere laid out a scalable "count connected node
> set" algorithm? Seems like it should be possible in log(N) time.

Well, that will work if your cluster gets split into two partitions. I
don't think you can get quorum if it can be partitioned into more than
two partitions at once (=before the connectivity algorithm ends)

> > * availability in failing partitions is sacrificed (just as CAP or any
> > other expert says)
>
> Yeah, but see note below.
>
> > * scalability of the surviving partitions depends on the size of
> > partitions - we aren't talking about clouds = "pool of resources
> > larger than anyone will ever need" anymore
>
> This is what always has confused me about CAP. What I recall of what
> I've read seems to insist that all parts of a partitioned collection
> still be able to satisfy all requests.

Yes, ABLE TO :-)

> From my point of view, this seems a priori impossible for the usual
> issues relating to split brain clusters. All RO requests are OK, if
> you had universal replication of all data. Expensive. But while
> reconciliation of some of updates may be possible after reconnection,
> but some will conflict.
>
> Which thoughts tell me why this discussion has veered into MVCC. Duh.
>
> And also reinforce my reaction to CAP as saying "I want the impossible
> and it's big news that I can't get it."

I perceive CAP _is_ the latter - i.e. to settle it once and for all:
don't waste your time, this IS impossible; (and insisting that it is
possible means there is a flaw in your design, we won't spend our time
looking for the bug in your PhD thesis, etc).

Sassa

> Greg Pfisterhttp://perilsofparallel.blogspot.com/

Sassa

unread,

Mar 8, 2010, 1:24:34 PM3/8/10

to Cloud Computing

Thank you, Jim, as always good technical leads, I'll go do some
homework on coteries now :-)

I won't argue about the rest of the points right now, but I have a bad
feeling about "Any elastic service can be rapidly spun up to full

power by plugging in more machines."

A cloud provider doesn't really have an infinite set of resources.
That's what was bugging me from day one on this list. I found
reconciliation in the definition of infinity as "more resources than
anyone will ever need". Starting from this point, it may be really
hard to prove that any partition will have "more resources than anyone
will ever need", hence elasticity promises can be hard to achieve.

Say, a cloud provider had 1000 machines. You don't know how many, but
more than you need. Now after partitioning you end up with a survivor
on a 100-node DC. Your service cannot be rapidly spun up, because
there are no more machines to plug in.

This sort of problem doesn't concern stateless services. But
transactional computations are stateful.

Sassa

> >>>>>> Buy 88 conference sessions and panels on cloud computing on DVD athttp://www.amazon.com/gp/product/B002H07SEC,http://www.amazon.com/gp/...instant access to downloadable versions athttp://cloudslam09.com/content/registration-5.html

Sassa

unread,

Mar 8, 2010, 1:38:33 PM3/8/10

to Cloud Computing

Well, if you put it THAT way, yes, it is doable for some partitions.
You can design a system to tolerate most probable partitions (i.e.
somewhat P) and remain somewhat C plus somewhat A. But I think in
CAP's terms it won't mean "absolutely partition-tolerant", if it is
"absolutely consistent" and "absolutely available" at the same time.

Sassa

Stuart Charlton

unread,

Mar 10, 2010, 11:07:31 AM3/10/10

to cloud-c...@googlegroups.com

Most day+ delays in batch data loads were driven by technical
limitations that the business worked around.

But it pervaded subject areas that led to poor data quality - for
example, customer records that became inconsistent across departments
and divisions. Through the prism of CAP, this is a crude form of A+P.

This lead to massive data warehouses and "master data management"
products to clean the data and present a "single view of the
customer"...

Stu

Sent from my iPhone

> --
> ~~~~~

> Register Today for Cloud Slam 2010 at official website - http://cloudslam10.com
> Posting guidelines: http://groups.google.ca/group/cloud-computing/web/frequently-asked-questions
> Follow us on Twitter http://twitter.com/cloudcomp_group or
> @cloudcomp_group
> Post Job/Resume at http://cloudjobs.net
> Buy 88 conference sessions and panels on cloud computing on DVD at
> http://www.amazon.com/gp/product/B002H07SEC, http://www.amazon.com/gp/product/B002H0IW1U

> or get instant access to downloadable versions at http://cloudslam09.com/content/registration-5.html

Sassa

unread,

Mar 13, 2010, 12:59:24 PM3/13/10

to Cloud Computing

I have three problems with Molina's method of assigning votes in a
distributed system:

* Not sure how this is going to work for infinite set of nodes
** Hence probably can't model the cloud as an infinite set of nodes

* Not sure how it works when the size of the cluster is dynamic, like
in elastic computing
** You should be concerned not with the current size of the
partitions, but with the future size of the partitions - i.e. it makes
sense to kill the partition that can't scale in preference to the
partition that can scale to a larger size

* Not sure how it works if partitioning ends up with more than two
partitions
** I.e. the meaning of "majority" is a strict inequality that one
partition can have more than a half of expected votes
** Cannot work even if you know the number of partitions, when it is
more than 2

Sassa

> >>>>>> Buy 88 conference sessions and panels on cloud computing on DVD athttp://www.amazon.com/gp/product/B002H07SEC,http://www.amazon.com/gp/...instant access to downloadable versions athttp://cloudslam09.com/content/registration-5.html

Jim Starkey

unread,

Mar 13, 2010, 2:32:49 PM3/13/10

to cloud-c...@googlegroups.com

Sassa wrote:

I have three problems with Molina's method of assigning votes in a
distributed system:

* Not sure how this is going to work for infinite set of nodes
** Hence probably can't model the cloud as an infinite set of nodes

The concept of infinite nodes violates the laws of both economics and physics. Putting that aside, assume a finite set of "canary node" that participate in coteries. A non-canary node is part of the the surviving partition if it can communicate with at least one canary node in the surviving partition.

* Not sure how it works when the size of the cluster is dynamic, like
in elastic computing

The coterie set can be changed by mutual consent. I'll leave the protocol as an exercise for the reader.

** You should be concerned not with the current size of the
partitions, but with the future size of the partitions - i.e. it makes
sense to kill the partition that can't scale in preference to the
partition that can scale to a larger size

By definition, the partitions can't communicate, so it isn't possible to evaluate which is larger. That said, you can get the effect you want by the selection of coteries.

* Not sure how it works if partitioning ends up with more than two
partitions

Not an issue. If any partition is large enough to contain a coteries, it is the only surviving partition.

** I.e. the meaning of "majority" is a strict inequality that one
partition can have more than a half of expected votes

Depending on how the coteries are selected, a majority isn't required. It's possible to have a hundred machines with only three coteries { a, b }, { b, c }, {a, c} such that only two of { a, b, c } would constitute a surviving partition. The flexibility of coteries let you plan for extreme sets of failures.

** Cannot work even if you know the number of partitions, when it is
more than 2

Assume there are n surviving partitions for n > 1. By the definition of surviving partition, each must contain a coterie. By the definition of partition, no two partitions can have communication nodes, therefore the coteries must be disjoint, which violates the definition of coteries.

Sassa

unread,

Mar 15, 2010, 7:39:01 AM3/15/10

to Cloud Computing

Assume you have those {a,b,c} canaries. Assume {c} is not reachable
anymore. {a,b} is the surviving partition. But scalability of that
partition is not proven. Can't scale - availability suffers.
Elasticity of the cloud can't add anything here.

On Mar 13, 7:32 pm, Jim Starkey <jstar...@nimbusdb.com> wrote:
> Sassa wrote:
> > I have three problems with Molina's method of assigning votes in a
> > distributed system:
>
> > * Not sure how this is going to work for infinite set of nodes
> > ** Hence probably can't model the cloud as an infinite set of nodes
>
> The concept of infinite nodes violates the laws of both economics and
> physics. Putting that aside, assume a finite set of "canary node" that

On a few occasions you mentioned cloud as an infinite pool of
resources, hence the remark.

A finite set of canaries means finite scale. Even if the complete set
of canaries survive, you can't say that you can scale out any time to
reconstitute any lost non-canary nodes.

I am referring here to your comment that a surviving partition can
scale to the necessary size.

Molina's work ends where clusters start to change their size over
time.

> participate in coteries. A non-canary node is part of the the surviving
> partition if it can communicate with at least one canary node in the
> surviving partition.> * Not sure how it works when the size of the cluster is dynamic, like
> > in elastic computing
>
> The coterie set can be changed by mutual consent. I'll leave the
> protocol as an exercise for the reader.> ** You should be concerned not with the current size of the

Yes, that's not the problem. The problem is that all partitions (the
dead ones and the surviving one) scale to different extent. So keeping
a partition that is locked in a 100-node room vs partition that
retained communication to 10 DCs has different implications.

> > partitions, but with the future size of the partitions - i.e. it makes
> > sense to kill the partition that can't scale in preference to the
> > partition that can scale to a larger size
>
> By definition, the partitions can't communicate, so it isn't possible to
> evaluate which is larger. That said, you can get the effect you want by
> the selection of coteries.> * Not sure how it works if partitioning ends up with more than two
> > partitions
>
> Not an issue. If any partition is large enough to contain a coteries,
> it is the only surviving partition.> ** I.e. the meaning of "majority" is a strict inequality that one
> > partition can have more than a half of expected votes
>
> Depending on how the coteries are selected, a majority isn't required.
> It's possible to have a hundred machines with only three coteries { a, b}, { b, c }, {a, c} such that only two of { a, b, c } would constitute a
>
> surviving partition. The flexibility of coteries let you plan for
> extreme sets of failures.> ** Cannot work even if you know the number of partitions, when it is
> > more than 2
>
> Assume there are n surviving partitions for n > 1. By the definition
> of surviving partition, each must contain a coterie. By the definition
> of partition, no two partitions can have communication nodes, therefore
> the coteries must be disjoint, which violates the definition of coteries.

OK, I understand that. This can work in a fixed-size cluster with no
proof of the surviving partition scaling out well. (and there are
partitions where all partitions will be deemed dead - i.e. {a},{b},
{c}; but ok, that's a partition not planned for, as you put it; and
yes, it does work for partitions planned for)

Sassa

> >>>>>>>> Buy 88 conference sessions and panels on cloud computing on DVD athttp://www.amazon.com/gp/product/B002H07SEC,http://www.amazon.com/gp/...access to downloadable versions athttp://cloudslam09.com/content/registration-5.html

Jim Starkey

unread,

Mar 15, 2010, 10:15:59 PM3/15/10

to cloud-c...@googlegroups.com

Sassa wrote:

Assume you have those {a,b,c} canaries. Assume {c} is not reachable
anymore. {a,b} is the surviving partition. But scalability of that
partition is not proven. Can't scale - availability suffers.
Elasticity of the cloud can't add anything here.

Sure it can -- add more nodes. It's not going to help the guys who can't reach them, but life is hard.

If Brewer's conjecture is reduced to "no communication == no consistency", well, then, duh.

Krishna Sankar

unread,

Mar 15, 2010, 10:45:55 PM3/15/10

to cloud-c...@googlegroups.com

http://blogs.wsj.com/digits/2010/03/15/berkeley-prof-helped-divvy-up-search-to-many-servers/?mod=rss_WSJBlog ... Duh ;o)

On 3/15/10 Mon Mar 15, 10, "Jim Starkey" <jsta...@nimbusdb.com> wrote:

Sassa wrote:

Assume you have those {a,b,c} canaries. Assume {c} is not reachable
anymore. {a,b} is the surviving partition. But scalability of that
partition is not proven. Can't scale - availability suffers.
Elasticity of the cloud can't add anything here.

Sure it can -- add more nodes. It's not going to help the guys who can't reach them, but life is hard.

If Brewer's conjecture is reduced to "no communication == no consistency", well, then, duh.

On Mar 13, 7:32 pm, Jim Starkey <jstar...@nimbusdb.com> <mailto:jstar...@nimbusdb.com> wrote:

Sassa


On Mar 3, 9:59 pm, Jim Starkey <jstar...@nimbusdb.com> <mailto:jstar...@nimbusdb.com> wrote:


Sassa wrote:


On Mar 1, 10:20 pm, Jim Starkey <jstar...@nimbusdb.com> <mailto:jstar...@nimbusdb.com> wrote:


Sassa wrote:

On Feb 26, 8:43 pm, Jim Starkey <jstar...@nimbusdb.com> <mailto:jstar...@nimbusdb.com> wrote:

Sassa

unread,

Mar 19, 2010, 4:30:37 PM3/19/10

to Cloud Computing

On Mar 16, 2:15 am, Jim Starkey <jstar...@nimbusdb.com> wrote:
> Sassa wrote:
> > Assume you have those {a,b,c} canaries. Assume {c} is not reachable
> > anymore. {a,b} is the surviving partition. But scalability of that
> > partition is not proven. Can't scale - availability suffers.
> > Elasticity of the cloud can't add anything here.
>
> Sure it can -- add more nodes. It's not going to help the guys who
> can't reach them, but life is hard.

"add more nodes" as in "get the lorry on the road"? I am referring to
the limit imposed by the physical size of the partition. If the DC has
only 1000 nodes, then a surviving partition locked in that DC won't
scale beyond 1000 nodes

unless you ship a truck-load of more nodes, which isn't the same as
"scale out on demand to 'infinite' size".

> If Brewer's conjecture is reduced to "no communication == no
> consistency", well, then, duh.

yes, it is reduced to exactly that - "no communication == no
consistency OR no availability". that's what it says on the label
anyway!

some say "told you!", yet others say "I've got a system that proves it
wrong!"

I take CAP theorem is a formal way to put a stop to the argument
whether "my system can have all 3!"

Sassa

Greg Pfister

unread,

Apr 7, 2010, 9:38:13 PM4/7/10

to Cloud Computing

Some new fodder for this mill: Stonebraker weighs in on CAP in CACM:
http://bit.ly/cCDWDS

My reading is that he seems to think it's not so much wrong as
irrelevant -- not useful: Situations where it might apply it either
doesn't in reality, or else they occur very infrequently.

A likely contentious position presented is that in his experience LAN
partitions are "exceedingly rare" and WAN partitions "quite rare."

(And he references NimbusDB!)

Greg Pfister
http://perilsofparallel.blogspot.com/

Sassa

unread,

Apr 9, 2010, 10:24:25 AM4/9/10

to Cloud Computing

Depends how you spin it. My reading is that "CAP" is always
applicable, except a few corner cases that any system would have
problem with. I think he insists that P(artitioning) is not as big a
deal; so even though CAP is right, it doesn't mean you want to
sacrifice C or A for P (hence a stab at NoSQL with eventual
consistency).

Sassa

On Apr 8, 2:38 am, Greg Pfister <greg.pfis...@gmail.com> wrote:
> Some new fodder for this mill: Stonebraker weighs in on CAP in CACM:http://bit.ly/cCDWDS
>
> My reading is that he seems to think it's not so much wrong as
> irrelevant -- not useful: Situations where it might apply it either
> doesn't in reality, or else they occur very infrequently.
>
> A likely contentious position presented is that in his experience LAN
> partitions are "exceedingly rare" and WAN partitions "quite rare."
>
> (And he references NimbusDB!)
>

> Greg Pfisterhttp://perilsofparallel.blogspot.com/

Reply all

Reply to author

Forward