Paper Title |
The Google File System |
Author(s) |
Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung |
Date | 2003 |
Novel Idea | A distributed filesystem optimized for very large files, high hardware and network failure rates, concurrent appends, sequential reads, and very few random reads and writes. |
Main Result(s) | They describe GFS, which provides: a concurrency model suitable for doing many simultaneous appends atomically; a master/chunkserver architecture that enables availability in the face of frequent failures; optimization for large files and sequential reads by using a very large chunk size, a very thin master (from the perspective of the client) who's job is to make good placements behind the scenes using global knowledge, but then grant leases to clients quickly and get out of the way, and incremental checksums. |
Impact | Well, google exists. That's pretty big I guess. I have no idea if Microsoft and the other big search providers use similar technology; it seems likely that they would have similar needs. |
Evidence | If the performance of google's services themselves are not evidence enough, they provide a lot of sandboxed benchmarks for things like read and write throughput, as well as data from actual instances running real google services. However, the data are so old at this point as to be all but meaningless; by now their hardware is certainly an order of magnitude or more faster than it was, so it's perhaps silly to worry too much about the numbers they state. Even at the time, without a comparison to another system that can function on the same scale, their numbers are hard to use for evaluation purposes. It would be nice to see some indications of the changes in performance they see due to making various design decisions, but they do not provide this. |
Prior Work | The linux kernel; AFS and every other distributed file system ever. |
Reproducibility | They describe all major components of the system at a high level, but do not go into implementation details. |
Question | They say that their hardware fails a lot, but they don't quantify this. What is the relationship between disk failure rates, number of chunkservers assigned a chunk, and availability? How about network failure rates? |
Criticism | Buh.. guh .. pretty good paper, honestly. Same issues as with any publication from a company instead of a research lab; they want to protect their IP to some degree. Main thing I saw missing I already addressed in my question above. |
Ideas for further work |
Trying to quantify the optimal replication rate given a hardware failure rate and an availability goal (and maybe a budget). |
Authors: Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung
Date: SOSP 2003
Novel Idea: GFS is designed for inexpensive commodity components that often fail and optimized for large (multi-GB) files that are mostly appended to (often concurrently) and then read (rarely randomly). A GFS cluster consists of a single master and multiple "chunkservers," which are monitored using heartbeats and kept consistent via a sophisticated replication procedure.
Main Result(s): GFS was able to saturate a 1 Gbps link between two switches while doing reads. After killing two chunkservers, they were restored to at least 2x replication within 2 minutes. The authors claim that GFS has successfully met Google's storage needs and is widely used within Google.
Impact: GFS showed that reexamining traditional file system assumptions in the light of modern Internet-scale systems could yield tangible performance benefits.
Evidence: The authors present a series of "micro-benchmarks" testing reads, writes, and record appends. They also examine the read/write rates and recovery time of two live GFS clusters. Finally, they describe the distribution of operations on chunkservers by size and by type.
Prior Work: GFS builds on several previous large distributed file systems, such as AFS and NASD.
Competitive work: In contrast to previous approaches, such as Minnesota's GFS and NASD, GFS is based on commodity hardware and does not seek to alter the model of the storage device.
Reproducibility: The source to GFS is not available, and it would not be trivial to replicate such a filesystem based only on the details given in the paper.
The Google File System
Authors
Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung
Date
SOSP'03 - ACM Symposium on Operating Systems Principles, 2003
Novel Idea
The paper presents the Google File System (GFS), a distributed file
system designed to deal with Google's workload characteristics. GFS
provides certain fault tolerance and availability guarantees, while it
monitors for data integrity through checksums.
Impact
The authors assume a scenario where component failures is common,
decide to optimize append operations in large files and aim high
throughput instead of focusing on latency. They provide a solution
that is structured as follows.
A single master indeed simplifies the implementation and allows it to
make better allocation decisions as it has global knowledge. The
master stores low amounts of filesystem metadata, in order to keep
everything in memory. It also synchronizes its information with other
replicas. The master replays operation logs to find the current
filesystem state, and sometimes flushes logs into the storage system
to keep its log small. The system interactions avoid incurring in too
much load on the master, which keeps low metadata.
There is a lease system for performing updates. The way it is
designed, it can cause what the authors call "consistent but
undefined" state. Therefore, applications usually work around this
problem using self-validating records. Chunks of data are checksummed
and versioned to help data integrity and consistency.
The master re-replicates chunks if the number of available replicas is
low. It rebalances replicas across chunkservers to increase load
balance.
Evidence
The authors provide measurements on a small cluster and on production
clusters. In the small cluster, and find their writes are not what
they expect, and that record append operations cause network
congestion.
In production clusters, they characterize very well their workload,
and indeed justify some of their system design assumptions (for
instance, their option to optimize record appends). They also show
that the master is indeed alleviated from a high load, despite the
size of the system.
Prior Work + Competitive Work
The authors cite GFS (another one, "The Global Filesystem") and GPFS
as alternatives that employ distributed algorithms to manage the
filesystem, but they claim their centralized approach is simpler yet
effective (and that centralized knowledge means better decisions).
They also mention NASD, based on network-attached disks, but they say
basically that GFS is better to their needs, and it is based on
commodity equipment, which probably have a better cost-benefit
relation. Lustre is cited as something that aims similar results, but
they claim GFS is more adequate for their own needs. Other systems are
mentioned, but, again, they mention their own workload characteristics
as hints for their design choices.
Reproducibility
It is difficult to reproduce the experiments that happened on
production clusters at Google. However the paper provides some insight
on their workload. This could be useful if someone wants to compare
performance with GFS, although the data provided in the paper is not a
full characterization (which doesn't seem viable, anyway).
Questions + Criticism
The authors claim that the network stack does not interact well with
their pipeline scheme to replicate data. They do not give further
explanation, but I think that it is an important issue, as their
system is focused on throughput (low throughput could deeply affect
their pipeline scheme). Are these problems caused by TCP buffering,
timeouts, ..., ? Did they change anything in the network stack after
this conclusion?
They also show that record appends cause network congestion. However,
they optimized such operation in the system, given its relevance to
their own needs. They could explain better why this congestion
appears. OK, the clients are writing a single file, but why the
variance (the throughput increase) as the number of clients increase
once again after 10 or so clients?
Ideas for Further Work
As the system employs a central master, it is perhaps viable to avoid
the problem where concurrent writes can leave the system in a state
where every client sees the same data, but the data itself is
inconsistent (this is caused by the lease system). As the authors are
more interested in throughput instead of latency, this seems something
that is possible to implement reasonably. Also, as concurrent record
appends are common and critical, perhaps avoiding "duplicate" appends
(a problem that can happen, according to the authors) is useful.
Perhaps a specialized master just to arbitrate on these issues could
solve the problem. The clients, then, would need to do less
work-arounds to deal with this potential issue.
On Mon, Sep 20, 2010 at 3:31 PM, Rodrigo Fonseca
<rodrigo...@gmail.com> wrote: