Review: Haystack

275 views
Skip to first unread message

Rodrigo Fonseca

unread,
Sep 20, 2010, 3:31:58 PM9/20/10
to CSCI2950-u Fall 10 - Brown
Please post your review as a group reply to this message.
Thanks,
Rodrigo

Tom Wall

unread,
Sep 20, 2010, 7:23:10 PM9/20/10
to CSCI2950-u Fall 10 - Brown
Finding a needle in a Haystack: Facebook's photo storage
Doug Beaver, Sanjeev Kumar, Harry C. Li, Jason Sobel, Peter Vajgel

Novel Idea:
Facebook introduces a photo storage system that attempts to minimize
disk access as much as possible. They do so by reducing the amount of
metadata used in each file so that all file metadata can be kept in
memory. Many photos are stored as a single file to further reduce disk
access. They also take advantages of some of the traits unique to
their photo sharing application to futher optimize their system.

Main Result:
They see some great improvements over their previous solution with
Haystack. They claim that it does 4x more reads per second and is 28%
cheaper per terabyte of storage. Average latencies for both reads and
writes seem reasonable.

Impact:
Nothing too revolutionary has been done with this paper, but like
dynamo, it is a good case study of how to adapt concepts from the
distributed systems community to suit your application's needs.

Prior/Competitive Work:
Haystack's design was influenced greatly by log based file systems.
They also note that their storage menchanism is similar in
architecture to NASD and OBFS. It is also similar to GFS in that the
system is optimized for appends rather than random writes.

Reproducibility:
Similar to the last couple of papers we've seen from large
corporations, there is not enough detail to easily reproduce both the
haystack system or their tests.

Question:
They mention that a photo will only be cached if the request comes
from a write enabled sever, with the intuition that recently written
photos are more popular. Wouldn't it be easier/better to use some
sort of timestamp on the photo? We don't learn how quickly a write
enabled server fills up, but it seems like the last few images put to
disk won't be cached as long as the first images, even though they
would be just as popular.

Are the benefits of using RAID 6 worth the effort? It might improve
read throughput, but there is a penalty to write speed (and cost), and
they are already have replication.

Criticism:
Some of the optimizations in section 3.5 could use a better
explanation. For example, during compaction, how do they perform the
atomic switch between the old store and the new store?

Figure 7 and its discussion seems like a waste of space. It make sense
that photos are read most right after they are written and that space
could have been used for some more interesting metrics.

Shah

unread,
Sep 20, 2010, 10:28:29 PM9/20/10
to CSCI2950-u Fall 10 - Brown
Title:

Finding a Needle in Haystack: Facebook's Photo Storage

Authors:

[1] Doug Beaver
[2] Sanjeev Kumar
[3] Harry C. Li
[4] Jason Sobel
[5] Peter Vajgel

Source and Date:

N/A

Novel Ideas:

Haystack stores all metadata in the main memory. It also comprises
three components that interact in a novel manner to provide a unique
photo storage solution: Store, Directory and Cache - each with a very
specific function. As the authors mention, the pillar idea in Haystack
is the ability to get a photo and its corresponding offset without any
disk operations.

Main Result:

This paper details the design and implementation of Facebook's
Haystack - an online photo storage system that has high throughput and
low latency, is fault-tolerant, cost-effective and "simple".

Impact:

As is clearly stated by the authors, Facebook is the largest online
photo repository. With well over 500 million users, the proper-
functioning of Haystack is an important metric in the social lives of
many.

Evidence:

The authors bolster their work by performing a series of experiments
that are divided into four sections. As is the case with papers
written by corporations, not enough granularity is provided -
especially on the methodology front.

Prior Work:

The authors mention that, to their best knowledge, Haystack is a
completely new design.

Competitive Work:

The authors give a list of several other references that can
efficiently manage small files and metadata more efficiently than
Haystack. They also list several other contemporary examples in
Section 5, under the paragraph labeled "Managing metadata" including
Spyglass, GLIMPSE and GIGA+. On the distributed filesystems side, it
has aspects that are similar to Petal, The Boxwood project, PNUT's
and Google File System.

Reproducibility:

As is clear by now - papers published by corporations purposely mask
too many details making their experiments impossible to reproduce.

Question:

Haystack is designed specifically to handle photos. Does Facebook
envision themselves designing applications that handle other specific
file types (like music)?

Criticism:

What do the authors mean by "magic number used for recovery" in the
caption for Figure 5? Is this information proprietary? If so, the
authors should state this clearly. Also, the timeline presented in
Figure 8 is presented only for a week. Is this enough? Also, as the
case with papers that are written by companies, not enough detail is
provided to reproduce or verify the experiments. There is a typo in
Section 5: one paragraph is labeled "Distrbuted filesystems" - instead
of "Distributed filesystems". (P.S: This typo is fixed in the latest
version of the paper.)

Ideas for Further Work:

Same as Question



James Chin

unread,
Sep 20, 2010, 10:57:16 PM9/20/10
to CSCI2950-u Fall 10 - Brown
Paper Title: “Finding a needle in Haystack: Facebook’s photo storage”

Authors(s): Doug Beaver, Sanjeev Kumar, Harry C. Li, Jason Sobel, and
Peter Vajgel

Date: 2010

Novel Idea: This paper describes Haystack, an object storage system
optimized for Facebook’s Photos application. Haystack provides a less
expensive and higher performing solution than Facebook’s previous
approach of NAS appliances over NFS by reducing the number of disk
operations per photo metadata so that all metadata lookups can be
performed in main memory.

Main Result(s): Haystack now provides a fault-tolerant and simple
solution to photo storage at dramatically less cost and higher
throughput than a traditional approach using NAS appliances. Also,
Haystack is incrementally scalable, which is necessary as Facebook
users upload hundreds of millions of photos each week.

Impact: CDNs do effectively serve the hottest photos -- profile
pictures and photos that have been recently uploaded, but a social
networking site like Facebook also generates a large number of
requests for less popular (often older) content, which is referred to
as the “long tail.” While it would be very convenient to cache all of
the photos for this long tail, doing so would not be cost-effective
because of the very large cache sizes required. Haystack is an
example of a system that deals with this issue by serving a long tail
of requests for photos in a cost-effective way.

Evidence: This paper demonstrates the effectiveness of the Haystack
Directory, Cache, and Store through synthetic and production
workloads.

Prior Work: Haystack takes after log-structured filesystems that
Rosenblum and Ousterhout designed to optimize write throughput with
the idea that most reads could be served out of cache. While
measurements and simulations have shown that log-structured
filesystems have not reached their full potential in local
filesystems, the core ideas are very relevant to Haystack.

Competitive Work: Haystack’s architecture share many similarities with
object storage systems proposed by Gibson et al. in Network-Attached
Secure Disks (NASD). The Haystack Directory and Store are perhaps
most similar to the File and Storage Manager concepts, respectively,
in NASD that separate the logical storage units from the physical
ones.

Reproducibility: It would be difficult to reproduce the findings in
this paper, as Haystack is built specifically for Facebook, a social
networking site with a massive scope.

Question: What object storage systems are used by other large photo
sharing web sites? Would those sites benefit more from a system like
Haystack?

Criticism: Although Facebook photo URLs are obscure, they are
completely open to the public.

Ideas for further work: Implement the calculation of metadata (instead
of looking it up).


On Sep 20, 3:31 pm, Rodrigo Fonseca <rodrigo.fons...@gmail.com> wrote:

Abhiram Natarajan

unread,
Sep 20, 2010, 9:41:04 PM9/20/10
to brown-csci...@googlegroups.com
Paper Title: Finding a needle in Haystack: Facebook's photo storage

Author(s): Doug Beaver, Sanjeev Kumar, Harry C. Li, Jason Sobel, Peter Vajgel

Date: 2010, OSDI

Novel Idea: Reducing per photo metadata so that storage machines can perform 
all metadata lookups in main memory; merging the photo serving tier and storage 
tier into one physical tier

Main Results: Haystack is/has:
                     (a) High throughput and low latency
                     (b) Fault-Tolerant
                     (c) Cost-effective
                     (d) Simple

Impact: The system is highly specific to facebook and its usage patterns. It is 
efficient and facilitates optimal retrieval of photos for facebook. One also gets to 
see how to build large, available and inexpensive systems which allow scalability.

Evidence: They provide a thorough treatment of this issue by (a) Characterizing the 
photo requests seen by facebook (b) Show the effectiveness of the cache (c) Show 
the effectiveness of the directory (d) Using synthetic and production workloads, 
they analyze how well the store performs

Prior Work: (a) log-structured filesystems
                  (b) Gibson's object storage systems for NASD
                  (c) OBFS
                  (d) GLIMPSE
                  (e) GIGA+
                  (f) Virtual Disks in Petal
                  (g) Boxwood project

Competitive Work: No direct comparison to old work; the system is one of kind.

Reproducibility: Reproducible in parts.

Question: Can we have more details about the kind of heuristics that are in place when
photos are read - maybe caching (because at a time, chunks of photos are typically
accessed. Also, what happens in case of failure of haystack servers?

Criticism: They really should provide more microscopic details.

Visawee

unread,
Sep 20, 2010, 9:55:19 PM9/20/10
to CSCI2950-u Fall 10 - Brown
Paper Title :
Finding a needle in Haystack: Facebook’s photo storage

Author(s) :
Doug Beaver, Sanjeev Kumar, Harry C. Li, Jason Sobel, Peter Vajgel,
Facebook Inc.

Date :
2010

Novel Idea :
Photo storage system that is custom made for Facebook workload. It
reduces latency by reducing number of disk accesses.
To achieve that, it reduces size of photo's metadata, and therefore,
be able to store all meta data in memory.
Also, It is designed for high throughput, and incremental scalability.

Main Result(s) :
The system is able to scale to serve huge workload very well with low
latency. It also has higher throughput at less cost than a traditional
approach Facebook was using (NAS appliances).

Impact :
An scalable object storage system that is specifically designed for
written once, read often, never modified, and deleted rarely objects.

Evidence :
The authors show some experiment results obtained from real workload
and synthetic workload.
The results from the real workload shows that the load balancing
approach in the system works very well, the cache hit rate of the
newly added photos is very high, and the system has a high throughput
and low latency.
The synthetic workload shows that the throughput of the system is very
near to the maximum sustainable throughput of the storage.
It also shows that the multi-writes approach works very well in
increasing the throughput of the system as well as reducing the access
latency.

Prior Work :
Haystack has some common ideas with log-structured file systems. It
also uses XFS on its store machines.

Competitive work :
PNUTS database provide more features and stronger guarantees, however,
those additional features in not needed by Facebook and Haystack is
more optimized for photo storage.
GFS is for a workload consisting mostly of append operations and large
sequential reads while Haystack is for a workload consisting of not
much writes and a lot of small random reads.

Reproducibility :
The results are not reproducible. The authors didn't mention all
detail implementations of the Haystack. They left out some detail, for
example, how the Haystack Directory distributes workload among
physical volumes.
Some experiment based on the real traffic from Facebook, and is not be
able to reproduced without help from Facebook.

Question :
What is the policy Facebook use to select between redirecting the
request to CDNs and to the Haystack cache?
Are there any file system that better serve the Haystack than XFS?

Criticism :
The authors mentioned that the Haystack Cache will let them
incrementally reduce their dependence on CDNs. However, this means
that, in order to achieve a good throughput and latency as using
CDNs,
they might have to have more of their cache sites geographically
distributed around the world. Would the maintenance cost outweigh the
cost of using CDNs?

Ideas for further work :
Use the notion that when people view a photo in a particular albums,
they also view other photos in that album to optimize the read
operations.
If we store every photo in an album in a contiguous region in a
physical volume, we can then retrieve every photo in that album by a
single disk access and put them in the cache.

On Sep 20, 3:31 pm, Rodrigo Fonseca <rodrigo.fons...@gmail.com> wrote:

Zikai

unread,
Sep 20, 2010, 4:05:34 PM9/20/10
to CSCI2950-u Fall 10 - Brown
Paper Title: Finding a needle in Haystack: Facebook’s photo storage
Author(s): Doug Beaver (Facebook) et. al
Date/Conference: Draft, 2010

Novel Idea:
(1) Excessive number of disk operations when accessing metadata of
photo files is the main obstacle to high throughput and low latency
for photo reads. Therefore, by effectively reducing meta data per
photo so that all meta data can be kept in main memory for fast
lookup, disk operations per photo read can be greatly decreased.
(2) CDN by themselves do not provide a practical solution to serving
photos on a social networking site.
(3) Focusing only on caching has limited impact for reducing disk
operations for reads in social networking’s photo service.
(4) Based on request pattern for Facebook (photos are most heavily
accessed soon after they are uploaded), combine CDN caching and
Haystack cache to shoulder most (90%) of requests and decrease burden
of Haystack store.

Main Results: (1) Design and implement Haystack, an object storage
system that successfully achieves efficient storage and retrieval of
billions of photos with high throughput, low latency, fault-tolerance
andcost-effectiveness.
(2) Generalize important lessons from building and scaling the system
(3) Characterize requests to Facebook’s photo sharing application.

Impact: Provide a new design point and template for large networking
service with long tail of photo requests. Others can learn from both
Haystack’s design and the lessons Facebook developers learned in
developing process.

Evidence: (1) In part 4.1, Facebook’s photo requests are
characterized. The features are the assumptions that Haystack is build
on.
(2) In part 4.2 and 4.3, figures from production workloads are used to
show directory’s load balancing and cache’s high hit rate.
(3) Part 4.4 uses synthetic workloads (Randomio and Haystress
benchmark) and production workloads to prove Haystack store can
effectively handle long tail of random photo reads with high
throughput and low latency.

Prior Work: Part 5 discusses related work in detail including the
fields of file systems, object-based storage, metadata managing and
distributed system.

Reproducibility: Compared with Dynamo paper, though Haystack paper is
also from a large business with its own business interests and
secrets, Haystack is much more reproducible because of its simple
design (especially compared with Dyanmo) and company’s different
policy (maybe). The main system components (directory, cache and
store), their interactions and main mechanisms within components are
well illustrated. Though not every detail is covered(e.g. how the
replicated database is set for directory, what kind of distributed
hash table is used for cache, cache eviction policies, etc.), one can
explore them while implementing the system. For the evaluation part,
Randomio can be easily accessed. The only problem is Haystress and
production workload for Facebook may not be simulated.

Question: (1) Haystack directory should handle heavy workload and be
fault tolerant as well. Which replicated database can have (1) high
throughput (2) low latency (3) fault-tolerance?
(2) Which distributed hash table is suitable for Haystack cache? With
what kind of cache eviction policy?

Criticism: Haystack is a perfect example for the design principle that
simplity is power.


On Sep 20, 3:31 pm, Rodrigo Fonseca <rodrigo.fons...@gmail.com> wrote:

Basil Crow

unread,
Sep 20, 2010, 11:59:42 PM9/20/10
to brown-csci...@googlegroups.com
Paper Title: Finding a needle in Haystack: Facebook's photo storage

Authors: Doug Beaver, Sanjeev Kumar, Harry C. Li, Jason Sobel, Peter Vajgel

Date: OSDI 2010

Novel Idea: Haystack is an object storage system that allows one to retrieving the filename, offset, and size for a particular file without any disk operations, thus dramatically improving throughput.

Main Result(s): Haystack achieves high throughput and low latency at a lower cost when compared to Facebook's previous NFS-based approach. According to the authors, "each usable terabyte costs ~28% less and processes ~4x more reads per second than an equivalent terabyte on a NAS appliance."

Impact: This case study may encourage the development of further specialized distributed systems for serving large amounts of custom data.

Evidence: Facebook assessed the performance of Haystack using the Randomio and Haystress benchmarks. Additionally, they described the average latency of read and multi-write operations over a three-week period.

Prior Work: Haystack takes after log-structured filesystems as described by Rosenblum and Ousterhout, and it also shares many similarities with object storage systems proposed by Gibson et al in NASD. Its notion of a logical volume is similar to Lee and Thekkath's virtual disks in Petal.

Competitive work: Haystack is unique in that it focuses on the long tail of photo requests seen by a social networking web site. In comparison with other metadata management approaches mentioned in Section 5, Haystack is considerably simpler.

Reproducibility: It would be very difficult to reproduce these findings, since any testing environment cannot come close to the proportions of Facebook's production infrastructure. Also, they have not released any code.

Hammurabi Mendes

unread,
Sep 20, 2010, 5:50:18 PM9/20/10
to brown-csci...@googlegroups.com
Paper Title

Finding a needle in Haystack: Facebook's photo storage

Authors

Doug Beaver, Sanjeev Kumar, Harry C. Li, Jason Sobel, Peter Vajgel

Date

Sometime in the future!

Novel Idea

The paper presents Haystack, which is a photo storage system used in
Facebook. The system decreases metadata lookups that were the main
limiting factor in Facebook's photo storage system, while still
providing high throughput and low latency.

Impact

Haystack is appropriate to Facebook's workload where photos are
written once, never modified, read often and continuously, and finally
not too often deleted. It's a simple architecture with three
components: the Store, the Directory, and the Cache. This is
accomplished by the structure described below.

The Directory load balances writes across write-enabled machines,
determines if a photo should be handled by a CDN (so the system also
relies on CDNs for popular pictures), and maps (logical volumes, photo
IDs) -> (physical volumes).

The Cache caches (!) photos not handled by the CDN: those are new
photos being written on the write-enabled machines (new photos are
more accessed).

The Store maintains an in-memory mapping between photo IDs and
(volumes, disk offsets). Therefore, they retrieve file and offset
locations of a photo with few disk operations.

Evidence

The paper describes very well their workload characteristics, and
indeed it appears to justify the system design. They provide evidence
of good load distributions (Fig. 8) and cache effectiveness (Fig. 9).
Similarly, it is clearly shown that batching writes actually increase
performance, and they relate it well with their usage scenario (people
usually update albums instead of single photos).

They also justify clearly the reasons why read operations are impacted
on write-enabled machines on the last paragraph of Sec. 4.4.3.

Prior Work

The authors appear to rely on the concepts employed in log structured
file systems, in which a high-performance log is used to speed up
write operations (the Haystack Caches alleviate read operations,
giving writes full buffer access).

The paper cites many previous studies on how to deal with metadata
management, including an interesting idea in Ceph (see sections Ideas
for Further Work).

They mention Petal as having the same idea of a logical volume and
also cite PNUTS as providing more database functionality than Facebook
needs, among other references.

Competitive Work

The authors mention Google's Bigtable, which is a distributed storage
system, but they are not sure whether it is adequate or flexible
enough for Facebook's needs.

As the system is extremely targeted at Facebook's particular workload,
the number of possible competitors is definitely restricted.

Reproducibility

The authors provide a good workload characterization in the paper, use
an open-source tool to evaluate the performance of Store machines
(Randomio), and also custom-built a benchmark program called
Haystress.

Anyway, it is difficult to reproduce the production environment of a
company, but they at least provide some benchmarks and good
information on their needs.

Questions + Criticism

[Criticism] I think the authors should discuss in more depth their
particular choice for file system. [Question] Is XFS's variable block
size the most important factor?

It appears that the machines used in their system are more capable
than commodity desktop machines (in contrast to the apparent Google's
approach). Does their system actually rely on having RAID controllers?
Isn't more cost-effective to provide availability with more replicas
instead of RAID-based storage?

Is Haystress available? If it is, I think it is an excellent way to
provide insightful information about their workload.

Ideas for Further Work

Perhaps using a custom-made file system could be a good idea for
Haystack. The costs are high, but it appears that XFS actually
provides much more than they need.

On Mon, Sep 20, 2010 at 3:31 PM, Rodrigo Fonseca
<rodrigo...@gmail.com> wrote:

Matt Mallozzi

unread,
Sep 20, 2010, 11:59:37 PM9/20/10
to brown-csci...@googlegroups.com
Matt Mallozzi
9/20/10

Title:
Finding a needle in Haystack: Facebook's photo storage
Authors:
Beaver, Kumar, Li, Sobel, Vajgel
Date:
2010
Novel Idea:
Moving metadata lookups to memory, rather than forcing them to read from
disk.
Main Results:
A working system that has been used in production at Facebook for 18 months,
drastically improving performance over the previous (conventional) system
of network attached storage accessed via NFS.
Impact:
A huge win for Facebook - photos are the most frequently accessed data at
Facebook, so this greatly increases Facebook's throughput and helps to make
user perceived latency as low as possible.
Evidence:
Haystack has survived under the ridiculous production loads of Facebook
Photos for 18 months. Some numbers are also provided that show production
and simulated performance.
Prior Work:
Haystack draws significant inspiration from log-structured filesystems.
Although Haystack is a distributed filesystem, it does not really resemble
any previous distributed filesystem.
Competitive Work:
The comparisons to related systems are mainly limited to features and
implementation details, rather than to performance. Hard numbers
representing the benefits of Haystack over the old NAS/NFS system were
barely mentioned - mostly just in the form of how different design decisions
affected the common amount of disk accesses required per operation.
Reproducibility:
Again, no source code available. Although some implementation details were
left out, the design is simple enough that the results are probably fairly
reproducible.
Question:
Would it be possible to store all of the album-sized photos of an album in
the same file, so that an album can be fetched all at once? (And similar
suggestions based on knowledge of how different sized photos are used in
certain contexts)
Criticism:
Each of the graphs (when applicable) should have shown the same metrics with
the same load being put on the previous NAS/NFS system. This would be
especially useful to startups and other young companies, which probably
start with a similar generic approach to storage and may be considering
putting the resources towards developing a more application-specific system.
Ideas for Further Work:
Modify the Store/Cache so that newly uploaded photos are pushed to the Cache
in anticipation of many upcoming reads.

On Mon, Sep 20, 2010 at 3:31 PM, Rodrigo Fonseca <rodrigo...@gmail.com> wrote:

Joost

unread,
Sep 21, 2010, 10:57:13 AM9/21/10
to CSCI2950-u Fall 10 - Brown
Paper: Finding a needle in Haystack: Facebook’s photo storage
Authors: Doug Beaver, Sanjeev Kumar, Harry C. Li, Jason Sobel, Peter
Vajgel,
Date: Draft 2010
Novel Idea: Given the constraints of the Facebook database, namely a
distributed database in which (at least for photo storage) files are
written once and read many times, the Haystack system seeks to speed
up the process of retrieving this information from its 2.3 petabytes
of photo data in as quick a manner as possible.
Main Results: The database that they implemented seeks to decrease the
amount of time it takes to access a particular photo from the server
by limiting the number of disk reads necessary to 1, as opposed to the
3 that happen in a standard NASD database. Furthermore,from
observations that an image is most viewed in the two days or so after
it is uploaded, Facebook seeks to balance the load of serving the
images between the CDNs associated with Facebook and the new Haystack
data structure.
Impact: The methods used by Facebook to achieve the throughput and
requests per second that this paper generates, while a dramatic
improvement over their old data structure, seems to be limited to
social networking sites in terms of future usability. However, the
load balancing techniques between CDNs and Facebook servers that this
paper presents as a corollary present an avenue for a larger number of
projects to expand and further develop that system.
Evidence: The authors presented some benchmark tests to demonstrate
that the new system is in fact better than the old system, however no
real comparison to other outside system is done. They also showed
usage data that help shape their design philosophy (namely the
cumulative distribution of data access over time) and then had a table
near the end showing that only about 10% of all image request ever get
to Haystack and the rest are taken care of by the CDNs sued by
Facebook.
Prior Work: The design built heavily on the typical NASD data
structure, but took items such as the use of in-memory metadata
mapping form GFS.
Competitive Work: For this particular data storage problem, none
really.
Reproducibility: Theoretically the paper provides enough groundwork to
create a data structure in the Haystack paper. However, without both
the exact nature and procedure of the coupling of the information with
a CDN, and without Haystress to perform similar benchmark stress tests
it would be difficult to compare any results from a data structure
built on this model to Haystack.
Question:Again what kind of algorithm goes into the load balancing
between the CDN and Haystack? In particular is it predictive or
reactive? And finally how much has this system really been tested and
optimized (also in cache usage) if the majority of the photos served
are not from the Haystack structure directly and when they are its in
the tail of the cumulative distribution of views.
Criticism: No benchmarks or tests were done on system recovery of any
kind. Nor was mass failure simulated in any of the tests they ran,
explicitly.
Future Work: A more through investigation of the coupling of data
structures liek this to CDNs and the load sharing would be really
interesting.

On Sep 20, 3:31 pm, Rodrigo Fonseca <rodrigo.fons...@gmail.com> wrote:
Reply all
Reply to author
Forward
0 new messages