Authors: Doug Beaver, Sanjeev Kumar, Harry C. Li, Jason Sobel, Peter Vajgel
Date: OSDI 2010
Novel Idea: Haystack is an object storage system that allows one to retrieving the filename, offset, and size for a particular file without any disk operations, thus dramatically improving throughput.
Main Result(s): Haystack achieves high throughput and low latency at a lower cost when compared to Facebook's previous NFS-based approach. According to the authors, "each usable terabyte costs ~28% less and processes ~4x more reads per second than an equivalent terabyte on a NAS appliance."
Impact: This case study may encourage the development of further specialized distributed systems for serving large amounts of custom data.
Evidence: Facebook assessed the performance of Haystack using the Randomio and Haystress benchmarks. Additionally, they described the average latency of read and multi-write operations over a three-week period.
Prior Work: Haystack takes after log-structured filesystems as described by Rosenblum and Ousterhout, and it also shares many similarities with object storage systems proposed by Gibson et al in NASD. Its notion of a logical volume is similar to Lee and Thekkath's virtual disks in Petal.
Competitive work: Haystack is unique in that it focuses on the long tail of photo requests seen by a social networking web site. In comparison with other metadata management approaches mentioned in Section 5, Haystack is considerably simpler.
Reproducibility: It would be very difficult to reproduce these findings, since any testing environment cannot come close to the proportions of Facebook's production infrastructure. Also, they have not released any code.
Finding a needle in Haystack: Facebook's photo storage
Authors
Doug Beaver, Sanjeev Kumar, Harry C. Li, Jason Sobel, Peter Vajgel
Date
Sometime in the future!
Novel Idea
The paper presents Haystack, which is a photo storage system used in
Facebook. The system decreases metadata lookups that were the main
limiting factor in Facebook's photo storage system, while still
providing high throughput and low latency.
Impact
Haystack is appropriate to Facebook's workload where photos are
written once, never modified, read often and continuously, and finally
not too often deleted. It's a simple architecture with three
components: the Store, the Directory, and the Cache. This is
accomplished by the structure described below.
The Directory load balances writes across write-enabled machines,
determines if a photo should be handled by a CDN (so the system also
relies on CDNs for popular pictures), and maps (logical volumes, photo
IDs) -> (physical volumes).
The Cache caches (!) photos not handled by the CDN: those are new
photos being written on the write-enabled machines (new photos are
more accessed).
The Store maintains an in-memory mapping between photo IDs and
(volumes, disk offsets). Therefore, they retrieve file and offset
locations of a photo with few disk operations.
Evidence
The paper describes very well their workload characteristics, and
indeed it appears to justify the system design. They provide evidence
of good load distributions (Fig. 8) and cache effectiveness (Fig. 9).
Similarly, it is clearly shown that batching writes actually increase
performance, and they relate it well with their usage scenario (people
usually update albums instead of single photos).
They also justify clearly the reasons why read operations are impacted
on write-enabled machines on the last paragraph of Sec. 4.4.3.
Prior Work
The authors appear to rely on the concepts employed in log structured
file systems, in which a high-performance log is used to speed up
write operations (the Haystack Caches alleviate read operations,
giving writes full buffer access).
The paper cites many previous studies on how to deal with metadata
management, including an interesting idea in Ceph (see sections Ideas
for Further Work).
They mention Petal as having the same idea of a logical volume and
also cite PNUTS as providing more database functionality than Facebook
needs, among other references.
Competitive Work
The authors mention Google's Bigtable, which is a distributed storage
system, but they are not sure whether it is adequate or flexible
enough for Facebook's needs.
As the system is extremely targeted at Facebook's particular workload,
the number of possible competitors is definitely restricted.
Reproducibility
The authors provide a good workload characterization in the paper, use
an open-source tool to evaluate the performance of Store machines
(Randomio), and also custom-built a benchmark program called
Haystress.
Anyway, it is difficult to reproduce the production environment of a
company, but they at least provide some benchmarks and good
information on their needs.
Questions + Criticism
[Criticism] I think the authors should discuss in more depth their
particular choice for file system. [Question] Is XFS's variable block
size the most important factor?
It appears that the machines used in their system are more capable
than commodity desktop machines (in contrast to the apparent Google's
approach). Does their system actually rely on having RAID controllers?
Isn't more cost-effective to provide availability with more replicas
instead of RAID-based storage?
Is Haystress available? If it is, I think it is an excellent way to
provide insightful information about their workload.
Ideas for Further Work
Perhaps using a custom-made file system could be a good idea for
Haystack. The costs are high, but it appears that XFS actually
provides much more than they need.
On Mon, Sep 20, 2010 at 3:31 PM, Rodrigo Fonseca
<rodrigo...@gmail.com> wrote: