happstack scales, SQL fails!

161 views
Skip to first unread message

stepcut

unread,
Jun 17, 2009, 7:39:26 PM6/17/09
to HAppS
Hello,

This is in response to comments in the future of happstack thread, but
I hope that it will generate enough discussion that it will deserve
its own thread.

I have seen a number of people (in that thread and other places)
express the feeling that happstack's persistence layer is fine for
small applications, but they are concerned that it does not scale to
midsize applications, and that they prefer to use something like
MySQL.

My counter-argument is that these people are not looking far enough
down the line of scalability, and that in fact, happstack is perhaps
the only technology that has the potential to smoothly scale from very
small to very large. If you look at sites like Google, Amazon, and
Facebook, you will very clearly see that MySQL/SQL does not scale.

To really understand what I am talking about, I highly recommend
watching this presentation on the facebook architecture:

http://www.infoq.com/presentations/Facebook-Software-Stack

You will learn some very interesting things about how facebook works:

- their mysql tables do *not* used ACID transactions (too slow)

- their sql tables only have two columns. They are just key/value
pairs.

- they do *not* do SQL joins -- too slow. Instead they return the
rows to PHP and do the 'joins' in PHP. Obviously, this takes more CPU
power (because there is no way that PHP is faster than MySQL). But it
turns out that adding even more databases servers to the pool is
really hard to scale. But adding more PHP frontend servers is easy and
does scale.

- in order to get their queries fast enough they have 800 memcached
servers with 28 terabytes of RAM to cache SQL query results. (As of
December 2008. They are looking to buy, or recently bought 50,000 new
servers, so they probably have even more now).

- despite all those memcached and sql servers, they still have to
have custom code which archives data out of their MySQL databases into
a different long-term storage medium (with longer access times) to
keep their MySQL databases running fast enough.

- in order to make their image serving platform (haystack) fast
enough, they don't even use a normal filesystem. They store all the
meta-data for each image in RAM. The meta-data includes the block
number of the image on the disk and the length. This way serving an
image only requires one disk seek instead of two.

It's pretty clear that facebook is not using MySQL as a transactional,
relational database. They are basically just using it as a thin
wrapper over berkeley DB (which is what provides the b-tree disk
storage used by MyISAM). I suspect they would prefer to use BerkeleyDB
directly instead of MySQL except that the cost of doing the switch is
too big.

So what can happstack take away from this?

First, to have a very large responsive site, you have to do away with
disk-access as much as possible. I forget the statistic but I think
they have a memcached hit rate of > 90%. (ie, >90% of the time, they
can get the data from memcached instead of having to hit the database
server).

Second, even with a sophisticated database, a general purpose disk
storage algorithm is not enough. You ultimately need a system that is
customized for your app and knows the data access patterns and when to
move things to slower/faster access locations.

Third, ACID transactions really don't scale (Amazon does not use them
either, and Google's map-reduce does not have them either as far as I
know).

Fourth, a one-size fits all solution might scale to a medium sized
solution, but won't scale to really big. Hence, a good solution is
likely to be very modular and very extensible so that it can be
customized to the specific needs of the site.

If you look at the facebook architecture, it is seems like they are
doing things fairly backwards. Their desire is to have all the
relevant data in RAM, but their persistent storage layer (aka, MySQL)
is inherently based on disk storage. In order to get what they want
they have to have a fancy caching layer, and do a lot of tweaking and
patching to mechached and mysql to try to get the right things in the
cache. Furthermore, they have to prune things from the SQL database to
keep it responsive enough.

It seems like what they really want is a system which starts with all
the data in RAM (where they want it) and gives them explicit control
over taking data out of RAM and storing it on disk based on the usage
patterns and metrics that are specific to their site. (And, when
storing to disk, they would probably forgo having a normal filesystem,
instead just storing block numbers in the RAM based persistence
layer).

So, I would argue that happstack-state is, in fact, the right
approach. It works for small sites, and it is what very large sites
end up implementing in an adhoc way. If you have a site which you hope
will grow from small to very large, you would ideally want to use a
technology which can scale from small to very large -- because you are
not going to have time to do a rewrite once your site starts to take
off (something they talk about in the presentation a fair bit).

The problem with happstack-state is not that it is the wrong approach,
but that it is very incomplete. It does not yet provide all the pieces
that you need to build your applications scalable persistence layer.

For example, I think it is sensible to expect that many sites (even
ones with 28TB of RAM) will need to store things on disk. So, clearly
there needs to be a sensible way to do that using happstack-state.
Some people have suggested that happstack-state should provide this
functionality automatically. I disagree slightly. I think that
happstack-state should provide the hooks so that it can be done. *and*
it should provide a default implementation/policy which uses those
hooks so that you can use it in your app. But later, you should be
able to replace that mechanism with a custom one that is fine tuned to
your specific site. Forcing the use of a specific policy won't scale.
Instead people will be forced to build some adhoc mechanism on top of
it.

I should note that Alex and company were reading the reports from very
large sites like Amazon when they designed happstack-state. So it is
not luck that happstack-state seems setup to scale really big. It has
always been an essential part of the plan.

I will happily agree though, that if you expect your site to be medium
sized in the next couple months -- MySQL does look pretty attractive
because it is here today :)

- jeremy



Matthew Elder

unread,
Jun 17, 2009, 8:34:40 PM6/17/09
to HA...@googlegroups.com
Well said, the state system is what peaked my interest initially. We
just need some more work done on it ;)

My intention was never to abandon state, but to merely suggest that it
is big enough to be its own project (independent of the http server).

Can we get a list of problem areas which we feel are preventing macid
from being complete? How can we take macid "to the next level"?

Here are my initial thoughts on problem areas for a developer:

Lack of tools to administer your state (ie command line tools with
which you can do maintenance tasks such as checkpointing with). Some
maintenance tasks should be generalizable.

Easier data/process integration outside of the app process. This could
be accomplished preferably with some syb REST facilities. Ideally
everything (including crontab-like tasks) would be tweakable within
your app, but there are often many real world situations where you
must build integration with external systems -- maybe even *gasp* php
applications ;)

Lack of visibility into data during debugging and/or troubleshooting
in production systems unless you explicitly build an interface. (Yes I
know most of you develop perfect applications which will never require
this ;) the easiest mechanism I can think of is a csv or xml dump.
OTOH if you have a very large dataset even this becomes unwieldy.
Another possibility would be to build automagical paginated
scaffolding (reminiscent of rails) -- again with syb, although with
free form data types perhaps autopagination is not possible.

My 2c
--
Sent from my mobile device

Need somewhere to put your code? http://patch-tag.com
Want to build a webapp? http://happstack.com

stepcut

unread,
Jun 17, 2009, 9:05:51 PM6/17/09
to HAppS
On Jun 17, 7:34 pm, Matthew Elder <m...@mattelder.org> wrote:

> Lack of tools to administer your state (ie command line tools with
> which you can do maintenance tasks such as checkpointing with). Some
> maintenance tasks should be generalizable.

In order to force a checkpoint, the running server process needs to
have some internal mechanism that will cause it do a checkpoint. So, I
think the right mindset is to create an admin library that you can
choose link into your application which provides this type of
functionality. This library could then listen on some port on
localhost for incoming telnet connections so you could just telnet in.
Or it could open a file socket and an external command-line interface
could issue commands, etc. The key is that most of this functionality
would be added to your app by just importing another library and
adding something like:

...
sid <- forkIO $ simpleHTTP nullConf myImpl
aid -< forkIO $ adminConsole
...

to your main.

> Easier data/process integration outside of the app process. This could
> be accomplished preferably with some syb REST facilities. Ideally
> everything (including crontab-like tasks) would be tweakable within
> your app, but there are often many real world situations where you
> must build integration with external systems -- maybe even *gasp* php
> applications ;)

Not really clear how this would work. Happstack applications are very
broad in nature. They may not even include an HTTP server. They may
have multiple independent cron threads running...

> Lack of visibility into data during debugging and/or troubleshooting
> in production systems unless you explicitly build an interface. (Yes I
> know most of you develop perfect applications which will never require
> this ;) the easiest mechanism I can think of is a csv or xml dump.

As far as I know, this mostly already exists. Anything that can be
serialized to a checkpoint can also be serialized as XML. In fact, the
checkpoint files used to be XML.

> OTOH if you have a very large dataset even this becomes unwieldy.
> Another possibility would be to build automagical paginated
> scaffolding (reminiscent of rails) -- again with syb, although with
> free form data types perhaps autopagination is not possible.

You could render the state data as a dynamically expandable tree...

- jeremy

Vagif Verdi

unread,
Jun 17, 2009, 11:47:12 PM6/17/09
to HAppS
"Look, other Big Guys have problems with MySQL. This proves that
happstack scales!"

It does not work that way :))

Regis Saint-Paul

unread,
Jun 18, 2009, 5:21:07 AM6/18/09
to HA...@googlegroups.com
> "Look, other Big Guys have problems with MySQL. This proves that
> happstack scales!"


Jeremy's note is far more subtle and mainly argues for a minimization
of disk access. However, his message arguably mixes different issues
which call for different solutions.

For instance, the solutions that can be used to scale an application
which is mostly about content delivery (like in the facebook image
case, where images are written once and retrieved many times) are very
different from the solutions you'd want to scale an application where
there are mainly writes or where you need to have some consistency
guarantees (e-Commerce style of amazon). In particular, scaling
transactional systems requires to abandon consistency because this
cannot be achieved at the same time as availability and partition
tolerance (known as the Brewer's conjecture [1]).

So this is not so much an issue about MySQL vs. Memory but a more
general algorithmic problem of deciding which properties the system
should have. And coming back to happstack, if scalability is desired,
then partition tolerance and reliability are needed, and hence there
needs to be some ways to handle inconsistency in the state. The
alternative is to forget about global states and assume a scalability
based on sharding of the clients, which restricts the range of
applications.

I have no clue which is the objective of Happstack in that matter and
this should be clarified. It is not possible to just say it "scales",
as this bears different meanings depending on the kind of applications
envisioned (and remember, no silver bullet, you won't address all
cases).

Best,
Regis

[1] Gilbert, Seth, and Nancy Lynch. 2002. “Brewer's Conjecture and the
Feasibility of Consistent Available Partition-Tolerant Web Services.”
IN ACM SIGACT NEWS 33:2002.

Gregory Collins

unread,
Jun 18, 2009, 5:51:59 AM6/18/09
to HA...@googlegroups.com
stepcut <jer...@n-heptane.com> writes:

> Hello,
>
> This is in response to comments in the future of happstack thread, but
> I hope that it will generate enough discussion that it will deserve
> its own thread.
>
> I have seen a number of people (in that thread and other places)
> express the feeling that happstack's persistence layer is fine for
> small applications, but they are concerned that it does not scale to
> midsize applications, and that they prefer to use something like
> MySQL.

SQL doesn't scale either -- eventually "single write master" means you
have to give up ACID & relational joins, as you mentioned.

Everyone running truly massive applications seems to be moving to
distributed key-value stores. Incidentally, I'm working (very slowly) on
one, based on top of Tokyo Tyrant, similar in design to this:

http://opensource.plurk.com/LightCloud/

G.
--
Gregory Collins <gr...@gregorycollins.net>

Duncan Coutts

unread,
Jun 18, 2009, 8:40:45 AM6/18/09
to HAppS
On Jun 18, 12:39 am, stepcut <jer...@n-heptane.com> wrote:
> It seems like what they really want is a system which starts with all
> the data in RAM (where they want it) and gives them explicit control
> over taking data out of RAM and storing it on disk based on the usage
> patterns and metrics that are specific to their site. (And, when
> storing to disk, they would probably forgo having a normal filesystem,
> instead just storing block numbers in the RAM based persistence
> layer).

My small contribution to this is:

http://code.haskell.org/hackage-server/Distribution/Server/Util/BlobStorage.hs

It's designed to work with the happs state stuff. It stores blobs by
their content hash. The id (hash) is kept in the normal RAM based
persistence layer. It'd be good to work out an API that handles both
filesystem persistence but also things like S3 or memcachedDB and
other similar things.

The API looks like:

-- | A persistent blob storage area. Blobs can be added and retrieved
but
-- not removed or modified.
--
data BlobStorage

-- | An id for a blob. The content of the blob is stable.
--
data BlobId

-- | Opens an existing or new blob storage area.
--
open :: FilePath -> IO BlobStorage

-- | Add a blob into the store. The result is a 'BlobId' that can be
used
-- later with 'fetch' to retrieve the blob content.
--
-- * This operation is idempotent. That is, adding the same content
again
-- gives the same 'BlobId'.
--
add :: BlobStorage -> ByteString -> IO BlobId

-- | Retrieve a blob from the store given its 'BlobId'.
--
-- * The content corresponding to a given 'BlobId' never changes.
--
-- * The blob must exist in the store or it is an error.
--
fetch :: BlobStorage -> BlobId -> IO ByteString


There's one extra variant on 'add' that allows some transformation or
checking to be done on the blob before it is allocated an Id and
atomically stored into the 'BlobStorage'. I'm not sure the API for
this yet is perfect. It's intended for uploads, stream the upload into
a temp file in the blob storage area, do some checking on it, possibly
transform or reject it but if it's ok then atomically swap it into the
persistent storage area.

(Yes, I know it uses md5 and should use something stronger, at the
time of writing the pureMD5 package provided a simple md5 hash type
with a Binary instance and I could not find an sha equivalent)

Duncan

Justin T. Sampson

unread,
Jun 18, 2009, 1:53:25 PM6/18/09
to HA...@googlegroups.com
On Thu, Jun 18, 2009 at 2:21 AM, Regis Saint-Paul <regis.s...@gmail.com> wrote:

I have no clue which is the objective of Happstack in that matter and
this should be clarified. It is not possible to just say it "scales",
as this bears different meanings depending on the kind of applications
envisioned (and remember, no silver bullet, you won't address all
cases).

I'll put a somewhat different spin on the matter. I work on Prevayler and haven't used HAppS/Happstack, but I think similar logic applies.

Basically, it is vastly simpler to develop and deploy an application based on Prevayler than to develop and deploy an application based on a relational database. It happens to also result in very good performance for a certain non-trivial class of applications. I think that class is larger than most people would guess, but that's tangential to my point here.

Furthermore, a relational database is only appropriate for a particular class of applications as well -- perhaps those with a lot of data and complicated queries but not really vast numbers of users.

Finally, by the principle of avoiding premature optimization, I really can't be sure in what ways my particular application will need to scale until I start observing real users on a deployed system. Therefore, I'm inclined to start with an approach that keeps my development and deployment as simple as possible, and which -- as a bonus -- I know has at least a decent chance of working well. I'll get my app out in the real world sooner, start getting real feedback, and have a cleaner basis for optimization going forward.

Cheers,
Justin

Reply all
Reply to author
Forward
0 new messages