Google Spanner and NoSQL in general

598 views
Skip to first unread message

Gilad Parann-Nissany

unread,
Sep 29, 2012, 5:19:58 AM9/29/12
to cloud-c...@googlegroups.com
Hi folks

Just read through http://research.google.com/archive/spanner.html  "Google's scalable, multi-version, globally-distributed, and synchronously-replicated database"

Would be interested in the group's view, especially regarding
  • does this provide true ACID?
  • does it scale to cloud scale?

Thanks, Regards
Gilad
__________________
Gilad Parann-Nissany
http://www.porticor.com/

Daniel Drozdzewski

unread,
Oct 1, 2012, 7:51:36 AM10/1/12
to cloud-c...@googlegroups.com
On 29 September 2012 10:19, Gilad Parann-Nissany <gilad....@gmail.com> wrote:
> Hi folks
>
> Just read through http://research.google.com/archive/spanner.html "Google's
> scalable, multi-version, globally-distributed, and synchronously-replicated
> database"
>
> Would be interested in the group's view, especially regarding
>
> does this provide true ACID?
> does it scale to cloud scale?


In their blurb there is a sentence answering both your questions: "It
is the first system to distribute data at global scale and support
externally-consistent distributed transactions."
They achieve consistency through multiversioning instead of locking
and they establish global serialiseability of transactions through
Time API (one accurate wall clock for all nodes).

It would be good to see Spanner being open sourced or at least some
more from under the hood.

--
Daniel Drozdzewski

Greg Pfister

unread,
Oct 1, 2012, 7:39:55 PM10/1/12
to cloud-c...@googlegroups.com
On Saturday, September 29, 2012 2:19:58 AM UTC-7, Gilad wrote:
Hi folks

Just read through http://research.google.com/archive/spanner.html  "Google's scalable, multi-version, globally-distributed, and synchronously-replicated database"

Would be interested in the group's view, especially regarding
  • does this provide true ACID?
  • does it scale to cloud scale?
"Yes" to cloud scale, as can be told from the first few paragraphs.

ACID - my reading is "not completely." (For reference: http://en.wikipedia.org/wiki/ACID)

Durable - yes, it writes to durable memory, designed to be replicated across wide areas (including cross-continent). That's about as durable as anything, anywhere.

Isolated - this is essentially the same as serializable, which they claim at the top left of page 2 and discuss extensively.

Atomic - not explicitly stated, but their discussion of supporting transactions, which can succeed or abort, implies that it is atomic when you do a transaction.

Consistent - not clear to me that this is present. It's unclear what "consistent" means when the application can stuff what it wants wherever it wants in the tablets / Collosus file systems. It's not like an RDBMS where you specify table content and relationships, and have triggers, etc. But maybe I'm misreading what "consistent" necessarily means in a general RDBMS.

So, is this the mythical cloud-scale ACID RDBMS? My take is "not exactly," but that's also not what they set out to implement.

Greg Pfister

Gilad Parann-Nissany

unread,
Oct 4, 2012, 6:28:33 AM10/4/12
to cloud-c...@googlegroups.com
Greg, thanks for the insight and analysis. I agree consistency is the central issue.

My reading is that they replace "long latency" instead of "eventual consistency". However, like you, I do not really understand what consistency means in this context and was hoping for help from the group. I am particularly confused by what happens if there is a high rate of updates - lets assume it is so high that the time between updates is shorter than the latency of updating the global system. What does consistency mean in that situation?

Regards
Gilad
__________________
Gilad Parann-Nissany
http://www.porticor.com/


--
-- ~~~~~
Posting guidelines: http://bit.ly/bL3u3v
Follow us on Twitter @cloudcomp_group @khazret_sapenov @cloudslam @up_con
 
Download hundreds of recorded cloud sessions at
- http://cloudslam.org/register
- http://up-con.com/register
- http://cloudslam09.com/content/registration-5.html
- http://cloudslam10.com/content/registration
 
or get it on DVD at
http://www.amazon.com/gp/product/B002H07SEC, http://www.amazon.com/gp/product/B004L1755W, http://www.amazon.com/gp/product/B002H0IW1U
 
~~~~~
You received this message because you are subscribed to the Google Groups "Cloud Computing" group.
---
You received this message because you are subscribed to the Google Groups "Cloud Computing" group.
To post to this group, send email to cloud-c...@googlegroups.com.
To unsubscribe from this group, send email to cloud-computi...@googlegroups.com.
Visit this group at http://groups.google.com/group/cloud-computing?hl=en.
For more options, visit https://groups.google.com/groups/opt_out.
 
 

Jim Starkey

unread,
Oct 4, 2012, 4:49:37 PM10/4/12
to cloud-c...@googlegroups.com
Sorry, but I have to quibble a bit.


On 10/1/2012 7:39 PM, Greg Pfister wrote:
On Saturday, September 29, 2012 2:19:58 AM UTC-7, Gilad wrote:
Hi folks

Just read through http://research.google.com/archive/spanner.html  "Google's scalable, multi-version, globally-distributed, and synchronously-replicated database"

Would be interested in the group's view, especially regarding
  • does this provide true ACID?
  • does it scale to cloud scale?
"Yes" to cloud scale, as can be told from the first few paragraphs.

Yes to cloud scale if you design the application and database to fit their model.  They are quite unusual in that their data definition language allows specification of logical storage affinity.  This means, for example, that a photo album record and its photographs will be stored together, so a join of a single album and its photographs will be quite fast.  Without this feature -- and correct usage -- the photographs could be scattered over a thousand computers around the world.  In other words, you have to be very careful what you ask for.

The Google applications that use Spanner are essentially single user applications in the sense that only one user is authorized to do updates.  This, in conjunction with placement control, allows a high volume of unrelated updates.

Under a contentious load, however, Spanner performance can be expected to drop off dramatically when the Paxos consensus protocols kick in (think two phase commit on steriods).

So I guess a fair answer would be yes if your problem matches their solution, otherwise probably not.



ACID - my reading is "not completely." (For reference: http://en.wikipedia.org/wiki/ACID)

Close enough in my book.



Durable - yes, it writes to durable memory, designed to be replicated across wide areas (including cross-continent). That's about as durable as anything, anywhere.

Yup.



Isolated - this is essentially the same as serializable, which they claim at the top left of page 2 and discuss extensively.

Not at all the same as serializable.  Serializability required for ACID, just consistency.

Isolated is a necessary precondition for consistent, however, so there's no getting around it.



Atomic - not explicitly stated, but their discussion of supporting transactions, which can succeed or abort, implies that it is atomic when you do a transaction.

Their use of the timestamps and the  Paxos protocol (http://en.wikipedia.org/wiki/Paxos_(computer_science)
guarantees atomicity.



Consistent - not clear to me that this is present. It's unclear what "consistent" means when the application can stuff what it wants wherever it wants in the tablets / Collosus file systems. It's not like an RDBMS where you specify table content and relationships, and have triggers, etc. But maybe I'm misreading what "consistent" necessarily means in a general RDBMS.

Consistent means that a transaction starts and end with a consistent state.  Details include enforcing uniqueness of keys, referential integrity, etc.  I don't believe the Spanner has anything but (unique) primary keys, which makes this simpler.


So, is this the mythical cloud-scale ACID RDBMS? My take is "not exactly," but that's also not what they set out to implement.

The mythical cloud-scale ACID RDBMS is approaching GA in Cambridge, Massachusetts.

Abhishek Pamecha

unread,
Oct 4, 2012, 5:29:14 PM10/4/12
to cloud-c...@googlegroups.com

Yes, it looks like the key design or the directory design is crucial to get good performance out of it OR extract ACID-ness out of it.


One thought occurred to me was that a GUID+metadata [say userId ] for that system would have to encapsulate (almost)every other ‘sub-directory’. Then are we looking at a single giant directory tree schema, which represents the entire system? Because conversely, if you partition it, you are going to lose consistency.


Any thoughts?

 

Thanks,

Abhishek

Greg Pfister

unread,
Oct 4, 2012, 6:42:11 PM10/4/12
to cloud-c...@googlegroups.com
On Thursday, October 4, 2012 4:28:33 AM UTC-6, Gilad wrote:
Greg, thanks for the insight and analysis.

You're welcome.
 
I agree consistency is the central issue.

It's an issue, I agree, and may be central in some contexts.
 

My reading is that they replace "long latency" instead of "eventual consistency". However, like you, I do not really understand what consistency means in this context and was hoping for help from the group. I am particularly confused by what happens if there is a high rate of updates - lets assume it is so high that the time between updates is shorter than the latency of updating the global system. What does consistency mean in that situation?

Our issue is what consistency means in any situation, not just that one.

From your description, I think you're really worrying about serializability (the "I" in ACID, for "Isolation").

First, assume updates are not coming in so fast that the DBS cannot handle them all; that's a different kettle of fish. 

But they can still come in faster than the update latency, including replication, so, e.g., you could have for transactions A, B, C: Aarrives, Barrives, Carrives, [very short interval between arrivals]... wait for a longer while until updates done... Acompletes, Bcompletes, Ccompletes. (Or you could get the completions in a different order.)

If they can be handled, then they are being pipelined: Many are in flight at the same time - perhaps literally, in flight over communications links. Then the issue becomes how they are kept straight - since they can have been sent from different sources, and arrive at various replicated nodes in different orders.

That's the core of serializability. It works. A defined order is preserved, using timestamps and the inherent capabilities of the Paxos algorithm: That's what Paxos does. It's the reason Paxos was created. The GPS & atomic clock things just make it more efficient, allowing assumptions to be made about a global time, something Paxos doesn't require, but (apparently) can make use of to reduce delays involved.

It works, in the sense that (at least) a consistent snapshot can be taken at any time, even while updates arrive faster than basic latencies (without total overloading).

Exactly how Paxos works -- all I can say is to go and try understanding it from the papers on it. The original one is kinda witty, but most find it obscure. This is alleged to be a simple description, although I still found it somewhat opaque:


-- 

Greg Pfister

Jim Starkey

unread,
Oct 4, 2012, 11:12:24 PM10/4/12
to cloud-c...@googlegroups.com
On 10/4/2012 5:29 PM, Abhishek Pamecha wrote:

Yes, it looks like the key design or the directory design is crucial to get good performance out of it OR extract ACID-ness out of it.


No tradeoff is necessary.


One thought occurred to me was that a GUID+metadata [say userId ] for that system would have to encapsulate (almost)every other �sub-directory�. Then are we looking at a single giant directory tree schema, which represents the entire system? Because conversely, if you partition it, you are going to lose consistency.


Partitioning doesn't sacrifice consistency.� Performance, generality, sanity, simplicity, sure, but consistency can be preserved with a two phase commit and/or Paxos.� There are more application-gentle alternatives, but that's a different question.


Any thoughts?

�

Thanks,

Abhishek


On Thu, Oct 4, 2012 at 1:49 PM, Jim Starkey <jsta...@nuodb.com> wrote:
Sorry, but I have to quibble a bit.


On 10/1/2012 7:39 PM, Greg Pfister wrote:
On Saturday, September 29, 2012 2:19:58 AM UTC-7, Gilad wrote:
Hi folks

Just read through�http://research.google.com/archive/spanner.html� "Google's scalable, multi-version, globally-distributed, and synchronously-replicated database"

Would be interested in the group's view, especially regarding
  • does this provide true ACID?
  • does it scale to cloud scale?
"Yes" to cloud scale, as can be told from the first few paragraphs.

Yes to cloud scale if you design the application and database to fit their model.� They are quite unusual in that their data definition language allows specification of logical storage affinity.� This means, for example, that a photo album record and its photographs will be stored together, so a join of a single album and its photographs will be quite fast.� Without this feature -- and correct usage -- the photographs could be scattered over a thousand computers around the world.� In other words, you have to be very careful what you ask for.

The Google applications that use Spanner are essentially single user applications in the sense that only one user is authorized to do updates.� This, in conjunction with placement control, allows a high volume of unrelated updates.


Under a contentious load, however, Spanner performance can be expected to drop off dramatically when the Paxos consensus protocols kick in (think two phase commit on steriods).

So I guess a fair answer would be yes if your problem matches their solution, otherwise probably not.



ACID - my reading is "not completely." (For reference:�http://en.wikipedia.org/wiki/ACID)

Close enough in my book.



Durable - yes, it writes to durable memory, designed to be replicated across wide areas (including cross-continent). That's about as durable as anything, anywhere.

Yup.



Isolated - this is essentially the same as serializable, which they claim at the top left of page 2 and discuss extensively.

Not at all the same as serializable.� Serializability required for ACID, just consistency.


Isolated is a necessary precondition for consistent, however, so there's no getting around it.



Atomic - not explicitly stated, but their discussion of supporting transactions, which can succeed or abort, implies that it is atomic when you do a transaction.

Their use of the timestamps and the� Paxos protocol (http://en.wikipedia.org/wiki/Paxos_(computer_science)
guarantees atomicity.



Consistent - not clear to me that this is present. It's unclear what "consistent" means when the application can stuff what it wants wherever it wants in the tablets / Collosus file systems. It's not like an RDBMS where you specify table content and relationships, and have triggers, etc. But maybe I'm misreading what "consistent" necessarily means in a general RDBMS.

Consistent means that a transaction starts and end with a consistent state.� Details include enforcing uniqueness of keys, referential integrity, etc.� I don't believe the Spanner has anything but (unique) primary keys, which makes this simpler.


So, is this the mythical cloud-scale ACID RDBMS? My take is "not exactly," but that's also not what they set out to implement.

The mythical cloud-scale ACID RDBMS is approaching GA in Cambridge, Massachusetts.

Greg Pfister
--
-- ~~~~~
Posting guidelines: http://bit.ly/bL3u3v
Follow us on Twitter @cloudcomp_group @khazret_sapenov @cloudslam @up_con
�
�
�

~~~~~
You received this message because you are subscribed to the Google Groups "Cloud Computing" group.
---
You received this message because you are subscribed to the Google Groups "Cloud Computing" group.
To post to this group, send email to cloud-c...@googlegroups.com.
To unsubscribe from this group, send email to cloud-computi...@googlegroups.com.
Visit this group at http://groups.google.com/group/cloud-computing?hl=en.
For more options, visit https://groups.google.com/groups/opt_out.
�
�

--
-- ~~~~~
Posting guidelines: http://bit.ly/bL3u3v
Follow us on Twitter @cloudcomp_group @khazret_sapenov @cloudslam @up_con
�
�
�

~~~~~
You received this message because you are subscribed to the Google Groups "Cloud Computing" group.
---
You received this message because you are subscribed to the Google Groups "Cloud Computing" group.
To post to this group, send email to cloud-c...@googlegroups.com.
To unsubscribe from this group, send email to cloud-computi...@googlegroups.com.
Visit this group at http://groups.google.com/group/cloud-computing?hl=en.
For more options, visit https://groups.google.com/groups/opt_out.
�
�

--
-- ~~~~~
Posting guidelines: http://bit.ly/bL3u3v
Follow us on Twitter @cloudcomp_group @khazret_sapenov @cloudslam @up_con
�
�
�

~~~~~
You received this message because you are subscribed to the Google Groups "Cloud Computing" group.
---
You received this message because you are subscribed to the Google Groups "Cloud Computing" group.
To post to this group, send email to cloud-c...@googlegroups.com.
To unsubscribe from this group, send email to cloud-computi...@googlegroups.com.
Visit this group at http://groups.google.com/group/cloud-computing?hl=en.
For more options, visit https://groups.google.com/groups/opt_out.
�
�

Gilad Parann-Nissany

unread,
Oct 5, 2012, 8:38:34 AM10/5/12
to cloud-c...@googlegroups.com
Thanks for the help Greg. With your tips, and re-reading the Google paper as well as the Paxos paper


I do come away with the understanding that (to use your notation) transactions that execute "Astart, Bstart, Cstart; Aarrives, Barrives, Carrives; Acompletes, Bcompletes, Ccompletes" are actually being ordered by the PAXOS leaders.  These PAXOS leaders lead a "group", which is apparently a group of geographically "close" servers, and are elected by that group.  That does answer the isolation/consistency question in an exact way. So I was wrong to think consistency is an issue.

As to replication across geographies, these replications between groups must be ordered by the PAXOS leader of the receiving group. Obviously this is a task that takes time, and has a big impact on latency. To control the length of time they take, the Spanner team says "the replication configurations for data can be dynamically controlled at a fine grain by applications. Applications can specify constraints to control which datacenters contain which data, how far data is from its users (to control read latency), how far replicas are from each other (to control write latency), and how many replicas are maintained (to control durability, availability, and read performance)".

My personal answer from reading all this:

The issue is more about latency: the PAXOS leaders need to crunch the incoming transactions fast enough, which is of course a challenge even in one group and more so when replication is done between groups.
They solve some of this, and achieve "cloud scale", by letting their users decide the tradeoff between the amount of time it takes to replicate across many data centers, and the desire to spread the data across a continent or the globe. They make the tradeoff explicit and allow their users a tool for defining their personal tradeoff.

Very good chat. Thanks Greg, and to the group in general.

Regards
Gilad
__________________
Gilad Parann-Nissany
http://www.porticor.com/


Abhishek Pamecha

unread,
Oct 5, 2012, 1:13:50 PM10/5/12
to cloud-c...@googlegroups.com



i Sent from my iPad with iMstakes 


On Oct 4, 2012, at 22:07, "Jim Starkey" <jsta...@nuodb.com> wrote:

On 10/4/2012 5:29 PM, Abhishek Pamecha wrote:

Yes, it looks like the key design or the directory design is crucial to get good performance out of it OR extract ACID-ness out of it.



No tradeoff is necessary.

 

I would disagree by directly quoting from the paper:

 

---Begin quote section 2.3

This interleaving of tables to form directories is significant because it allows clients to describe the locality relation- ships that exist between multiple tables, which is nec- essary for good performance in a sharded, distributed database. Without it, Spanner would not know the most important locality relationships. 

 

--end quote 

 

One thought occurred to me was that a GUID+metadata [say userId ] for that system would have to encapsulate (almost)every other ‘sub-directory’. Then are we looking at a single giant directory tree schema, which represents the entire system? Because conversely, if you partition it, you are going to lose consistency.



Partitioning doesn't sacrifice consistency.  Performance, generality, sanity, simplicity, sure, but consistency can be preserved with a two phase commit and/or Paxos.  There are more application-gentle alternatives, but that's a different question.

 

 

Going by Brewer's CAP conjecture, if we want to maintain the same availability(which i think would be the case), partitioning will sacrifice consistency.

 

Also, as per the example in the paper( again section 2.3)  it seems to occur that USERS table would be parent to all the other tables that you may need for your system.  Conversely said,all other tables would need to interleaved in the USERS table.  Or else how do you store other data related to user : for ex: videos, comments, address book etc.. with locality of data in mind. 

 

Any thoughts?

 

Thanks,

Abhishek



On Thu, Oct 4, 2012 at 8:12 PM, Jim Starkey <jsta...@nuodb.com> wrote:
On 10/4/2012 5:29 PM, Abhishek Pamecha wrote:

Yes, it looks like the key design or the directory design is crucial to get good performance out of it OR extract ACID-ness out of it.


No tradeoff is necessary.


One thought occurred to me was that a GUID+metadata [say userId ] for that system would have to encapsulate (almost)every other ‘sub-directory’. Then are we looking at a single giant directory tree schema, which represents the entire system? Because conversely, if you partition it, you are going to lose consistency.


Partitioning doesn't sacrifice consistency.  Performance, generality, sanity, simplicity, sure, but consistency can be preserved with a two phase commit and/or Paxos.  There are more application-gentle alternatives, but that's a different question.



Any thoughts?

 

Thanks,

Abhishek


On Thu, Oct 4, 2012 at 1:49 PM, Jim Starkey <jsta...@nuodb.com> wrote:
Sorry, but I have to quibble a bit.


On 10/1/2012 7:39 PM, Greg Pfister wrote:
On Saturday, September 29, 2012 2:19:58 AM UTC-7, Gilad wrote:
Hi folks

Just read through http://research.google.com/archive/spanner.html  "Google's scalable, multi-version, globally-distributed, and synchronously-replicated database"

Would be interested in the group's view, especially regarding
  • does this provide true ACID?
  • does it scale to cloud scale?
"Yes" to cloud scale, as can be told from the first few paragraphs.

Yes to cloud scale if you design the application and database to fit their model.  They are quite unusual in that their data definition language allows specification of logical storage affinity.  This means, for example, that a photo album record and its photographs will be stored together, so a join of a single album and its photographs will be quite fast.  Without this feature -- and correct usage -- the photographs could be scattered over a thousand computers around the world.  In other words, you have to be very careful what you ask for.

The Google applications that use Spanner are essentially single user applications in the sense that only one user is authorized to do updates.  This, in conjunction with placement control, allows a high volume of unrelated updates.


Under a contentious load, however, Spanner performance can be expected to drop off dramatically when the Paxos consensus protocols kick in (think two phase commit on steriods).

So I guess a fair answer would be yes if your problem matches their solution, otherwise probably not.



ACID - my reading is "not completely." (For reference: http://en.wikipedia.org/wiki/ACID)

Close enough in my book.



Durable - yes, it writes to durable memory, designed to be replicated across wide areas (including cross-continent). That's about as durable as anything, anywhere.

Yup.



Isolated - this is essentially the same as serializable, which they claim at the top left of page 2 and discuss extensively.

Not at all the same as serializable.  Serializability required for ACID, just consistency.


Isolated is a necessary precondition for consistent, however, so there's no getting around it.



Atomic - not explicitly stated, but their discussion of supporting transactions, which can succeed or abort, implies that it is atomic when you do a transaction.

Their use of the timestamps and the  Paxos protocol (http://en.wikipedia.org/wiki/Paxos_(computer_science)
guarantees atomicity.



Consistent - not clear to me that this is present. It's unclear what "consistent" means when the application can stuff what it wants wherever it wants in the tablets / Collosus file systems. It's not like an RDBMS where you specify table content and relationships, and have triggers, etc. But maybe I'm misreading what "consistent" necessarily means in a general RDBMS.

Consistent means that a transaction starts and end with a consistent state.  Details include enforcing uniqueness of keys, referential integrity, etc.  I don't believe the Spanner has anything but (unique) primary keys, which makes this simpler.


So, is this the mythical cloud-scale ACID RDBMS? My take is "not exactly," but that's also not what they set out to implement.

The mythical cloud-scale ACID RDBMS is approaching GA in Cambridge, Massachusetts.

Greg Pfister
--
-- ~~~~~
Posting guidelines: http://bit.ly/bL3u3v
Follow us on Twitter @cloudcomp_group @khazret_sapenov @cloudslam @up_con
 
~~~~~
You received this message because you are subscribed to the Google Groups "Cloud Computing" group.
---
You received this message because you are subscribed to the Google Groups "Cloud Computing" group.
To post to this group, send email to cloud-c...@googlegroups.com.
To unsubscribe from this group, send email to cloud-computi...@googlegroups.com.
Visit this group at http://groups.google.com/group/cloud-computing?hl=en.
For more options, visit https://groups.google.com/groups/opt_out.
 
 
--
-- ~~~~~
Posting guidelines: http://bit.ly/bL3u3v
Follow us on Twitter @cloudcomp_group @khazret_sapenov @cloudslam @up_con
 
~~~~~
You received this message because you are subscribed to the Google Groups "Cloud Computing" group.
---
You received this message because you are subscribed to the Google Groups "Cloud Computing" group.
To post to this group, send email to cloud-c...@googlegroups.com.
To unsubscribe from this group, send email to cloud-computi...@googlegroups.com.
Visit this group at http://groups.google.com/group/cloud-computing?hl=en.
For more options, visit https://groups.google.com/groups/opt_out.
 
 
--
-- ~~~~~
Posting guidelines: http://bit.ly/bL3u3v
Follow us on Twitter @cloudcomp_group @khazret_sapenov @cloudslam @up_con
 
~~~~~
You received this message because you are subscribed to the Google Groups "Cloud Computing" group.
---
You received this message because you are subscribed to the Google Groups "Cloud Computing" group.
To post to this group, send email to cloud-c...@googlegroups.com.
To unsubscribe from this group, send email to cloud-computi...@googlegroups.com.
Visit this group at http://groups.google.com/group/cloud-computing?hl=en.
For more options, visit https://groups.google.com/groups/opt_out.
 
 

--
-- ~~~~~
Posting guidelines: http://bit.ly/bL3u3v
Follow us on Twitter @cloudcomp_group @khazret_sapenov @cloudslam @up_con
 
Reply all
Reply to author
Forward
0 new messages