Authors: Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Mike Burrows, Tushar Chandra, Andrew Fikes, Robert E. Gruber
Date: OSDI 2006
Novel idea: Bigtable is a database system which trades off conventional RDBMS features (such as a full relational data model and transactions) in exchange for scalability and high performance. The authors describe Bigtable as a "sparse, distributed, persistent multidimensional sorted map" with compression in which clients can control the locality of their data.
Main results: As of August 2006, there were 388 non-test Bigtable clusters running in various Google machine clusters. For exampole, the Google Earth team uses Bigtable to serve tens of thousands of queries per second with low latency on a 500 GB database table.
Impact: Bigtable allows application developers to store data in a flexible way while also worrying less about potential scalability issues.
Evidence: The authors tested random and sequential reads/writes as well as scans from both single tablet servers in the case of both single and aggregate tablet servers. They found that performance increased significantly (though not linearly) as the number of tablet servers increased. The authors also described the manner in which several internal groups used Bigtable (Google Analytics, Google Earth, and Personalized Search).
Prior work: Bigtable is based the Google File System (GFS), Chubby Lock Service, and a few other Google programs. In terms of prior art, Bigtable goes beyond distributed B-trees and distributed hash tables, which the authors claim are "too limiting."
Competitive work: Bigtable shares similarities with Boxwood, but it seeks to present an API at a lower lever (i.e., directly to application developers). However, it is less flexible than Oracle's Real Application Cluster and IBM's DB2 Parallel Edition (both of which provide a complete relational model with transactions).
Reproducibility: It's not possible to reproduce Bigtable directly, but there are several related/similar open source projects (HBase, Hypertable, Apache Cassandra, Neptune, and KDI).
Question: Is compression as much of a concern nowadays as it was in 2006?
Criticism: The authors discuss in the "Lessons" section that large distributed systems are vulnerable to many types of failures, and even give a few examples (memory and network corruption, large clock skew, hung machines, etc.). However, they don't give any details of these scenarios. It would be nice to have a few examples. However, their choice is understandable given the length of the paper.
Bigtable: A Distributed Storage System for Structured Data
Authors
Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A.
Wallach Mike Burrows, Tushar Chandra, Andrew Fikes, Robert E. Gruber.
Date
7th USENIX Symposium on Operating Systems Design and Implementation
Novel Idea
Bigtable is a distributed storage system in which entries are indexed
by arbitrary row and column identifiers, and a timestamp.
Main Results
The system provides the previous interface while its design allows the
system to reach the multi-petabyte scale in a cluster of machines. The
paper also describes many refinements such locality groups and
insightful compression mechanisms that improve the system performance.
Impact
By describing a system conceived to deal with Google's workload, the
paper demonstrates that centralized, structurally simple designs can
be formulated to scale and be applied in practice, if the right design
decisions are taken to alleviate the load on the central parts, and
the overall structure is tightly coupled to the workload in question.
Evidence
The authors decide on using 6 different kinds of operations to analyze
the system performance. Those are random reads (both when the data
must be fetched from GFS and when it is in memory), random writes,
sequential reads and writes, and scans (sequentially reading all the
values within many rows).
The data shows the system scales. For instance, the rate of operations
per tablet server when using 500 tablet servers only drops to about
1/2 to about 1/5 of the load when using a single tablet server.
The paper also shows how real Google applications benefit from the
system, showing its practicality.
However, some aspects related to fault tolerance were not explored in
the experimental section (see section "Questions + Criticism").
Prior Work
The paper relates Bigtable's feature of "locality groups" with a
system called C-Store, which organizes data by column. (Others are
also mentioned in this matter: Sybase IQ, SenSage, KDB+.)
They also mention how C-Store and Bigtable have similar ways to store
things, but they make a difference on the API provided by each system
(C-Store provides a read-optimized relational model, while Bigtable
provides simpler, throughput-optimized operations).
Competitive Work
The paper cites the Boxwood project as a possible competitor, but they
note that while this project aims at providing a basis for
higher-level services, Bigtable is directly aimed at Google's
application needs.
The paper also cites industry database vendors, such as Oracle and
IBM, but implies that these aim at different goals (namely, the
transactional relational model). Moreover, they mention that DHTs
provide much more features than what Bigtable actually needed.
Reproducibility
If the system itself and enough machines were available, the
experimental methodology is reproducible. Some data provided about
fault tolerance, however, is clearly not reproducible (see below).
Question + Criticism
[Criticism] The authors do not provide experimental results regarding
issues related to faults. They provide some data when they describe
Chubby, but the corresponding experiments are not discussed.
[Questions] Without such analysis, many questions could be asked.
What's the overall system availability? How much does it depend on, or
it is affected by, the availability of Chubby and GFS?
Ideas for Further Work
I wonder if it is viable that the system could identify locality
groups and decide their tuning parameters dynamically, based on the
requests made by an application. Perhaps the client library could help
on this.
However, it can be the case that the system was designed assuming that
no one besides the application should know better about its optimal
locality group setting. But if the access depend on user's input, this
feature might be useful.
On Wed, Sep 22, 2010 at 9:09 PM, Rodrigo Fonseca
<rodrigo...@gmail.com> wrote:
| Paper Title |
Bigtable: A Distributed Storage System for Structured Data |
| Author(s) |
|
Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach |
Mike Burrows, Tushar Chandra, Andrew Fikes, Robert E. Gruber | |
| Date | 2006 |
| Novel Idea | A distributed data store, built on top of GFS (and with Chubby in between), that provides row/column access for applications that need to structure their data, but are willing to trade full sql access for extreme scalability, low latency, and high availability. |
| Main Result(s) |
They describe Bigtable, which uses a bunch of tricks to achieve its goals. Google wants basically all of its applications to want to use it, so it provides fairly minimal structure: keys and data are all just uninterpreted bytes, everything is stored lexicographically and it's up to the application to provide locality if it cares to, and rather than a relationally-complete query language, they support regexes on row keys and grouping columns together into families, but not much else. They want extreme scalability, which they achieve by partitioning tables into tablets and storing different tablets on different machines, managed by a single (but replaceable) master. The master almost never has to interact with clients, and thus cannot become a bottleneck. Bigtable does not need to provide additional data replication, since Bigtable is already on top of Chubby and GFS. It does need to care about fragmentation and tablet server load though, so it splits and merges tablets when they cross size thresholds in the appropriate directions. They devote a good amount of space to describing optimizations on the basic structure, including: a client-configurable compression scheme that makes it easy to take advantage of locality (and in some cases the properties of the data e.g. web pages from the same domain with boilerplate in common) to achieve extremely high compression ratios very quickly; two different kinds of caching, one for commonly-accessed cells and one for data near recently accessed data; locality-based filters that for some applications can avoid huge numbers of disk accesses by assuring Bigtable that the requested data isn't in that part of the table. |
| Impact | I never have anything interesting to write in this section. |
| Evidence |
They do pretty in-depth performance analysis on a number of axes at each of 1, 50, 250, and 500 tablet servers, giving a good sense of where the bottlenecks are (random reads) and how each is affected by scaling (most get a lot better; random reads still don't scale well). They also give a good overview table of actual applications and their bigtable usage statistics. This shows that Bigtable is pretty adaptable, as the applications range widely in space usage, compressibility, column density, etc. |
| Prior Work | The most interesting note in their Related Work section is the one where they say that bigtable's row/column structure is nearly as lightweight as {Chord, Tapestry, Pastry}'s key/value structure (and thus still suitable for widely distributed use) but significantly more powerful and useful for developers. Perhaps those projects would be in more widespread use if they adopted bigtables model, and added their trademark ability to function in untrusted and highly variable environments? |
| Reproducibility | They give a little bit of code! Otherwise, same old, same old. |
| Question | Why don't more things use this row/column structure? |
| Criticism | It'd be nice if they explained what Chubby was in a bit more detail at first. It becomes sort of clear as they introduce each way it is used, but the first paragraph they mentioned it in pretty much lost me. |
| Ideas for further work |
Put row/column structure on top of Tapestry, etc.; see if this is more awesome. |