repo refactoring

Alex Jacobson

unread,

Dec 24, 2007, 5:17:58 PM12/24/07

to HA...@googlegroups.com

We've have a lot of new code since the last release, I am thinking about
refactoring the repo structure and urls and writing copy to describe
them for hackage and other directories.

== High Level Framework ==

* HAppS-State -- Global in memory Haskell state with ACID guarantees.

Unplug your machine and restart and have your app recover to exactly
where it left off. HAppS-State spares you the need to deal with all the
marshalling, consistency, and configuration headache that you would have
if you used an external DBMS of this purpose. Its elegant component
model starts maintenance routines in background without you having to
futz. Is provides event subscription to triggers IO actions and support
comet-style or irc-bot applications.

[Move Serialize into HAppS-Data]

* Happs-IxSet -- Efficient relational queries on Haskell sets.

No need to marshall your data to/from an external relational database
just for efficient queries, just pick which parts of your data
structures you want indexed. IxSet relies on generics and TH to spare
you the boilerplate normally required for such tasks.

* HApps-Data -- Automagically convert haskell types to/from XML and
name-value pairs for use in AJAX, web forms, and state serialization.

Supports url-encoded or mime/multipart, type-specific representations,
type-safe migrations, and public/private differentiation. Let haskell's
compiler help you make sure that, as the shape of your application types
changes, you don't lose consistency with persistent internal state or
the web forms/cookies waiting to be posted.

[MessageWrap, XSLT, Serialize, and Crypto moved in.
Arguably GOpS, HList, and Atom should be moved out, but where to put them?]

* HAppS-Server -- High performance dynamic web services framework.

Comes with elegant monadic framework for transforming requests into
responses. Use either as a public facing web server or behind a caching
proxy server via fast-cgi. Supports long transactions for ajax/comet-apps.

[Move MessageWrap and XSLT to HAppS-Data. Move DNS client code out from
this repo to HAppS-DNS. Move SMTP and UDP code to HAppS-Network?]

* HAppS-Agents -- Sessions, FlashMsgs, HelpReqs, and maybe soon Mail,
IRCBot, and UserAccounts.

Stateful stuff that sometimes needs to take initiative. Common
components of many standard web apps.

[Currently stuff in HAppS.Server.Store. New hierarchy, HAppS.Agent.*
Could this have a better name?]

== Build/Deploy Support ==

* HAppS-Begin -- Get started quick skeleton on which to hang your HAppS
apps.

Do a "darcs get http://happs.org/HAppS-Begin MyProject; cd MyProject; sp
ghc -ihaskell haskell/Main.hs --run." Then open your browser to
http://localhost:5000 and see "hello world." Modify the source and
reload the page. Let searchpath take care of auto-rebuilding and
restarting. Watch the shell window for error messages.

[Will probably strip out the stuff currently in HAppS-Begin and push it
into real examples and documentation.]

* SearchPath -- Automatic recursive import chasing across the internet
with auto-recompile/restart for easy all-in-one application builds and
development.

Supports building from haskell source exposed either via a URL based
directory hierarchy, a TGZ file accessible at a URL, or in any shell
based version control system including darcs, svn, cvs, etc. Have your
app rebuild and restart automatically as you change source so it feels
like you are working in an interpreted language.

== Infrastructure ==

* HApps-DNS -- Pure Haskell DNS resolver.

No need to link to C-libs and suffer build complexity.

[To be used by future mail code. Should have a separate existence.]

* HAppS-Protocols -- HTTP, SMTP, UDP, and IRC types and utility code.

[I hope to kill HAppS-Util]

----
Thoughts? Should we come up with better names?

-Alex-

PS HAppSy holidays! (couldn't help myself.)

eugene.g...@gmail.com

unread,

Dec 25, 2007, 7:52:25 AM12/25/07

to HAppS

I like everything here.

Merry Christmas to everybody!!

> Do a "darcs gethttp://happs.org/HAppS-BeginMyProject; cd MyProject; sp
> ghc -ihaskell haskell/Main.hs --run." Then open your browser tohttp://localhost:5000and see "hello world." Modify the source and

Andrey Chudnov

unread,

Dec 27, 2007, 4:44:42 AM12/27/07

to HAppS

Great news! This all looks like we are going to see the 1.0 version!
(just kidding :))

Seriously speaking, I've got confused with the current description of
the State.
From your previous note (http://groups.google.com/group/HAppS/
browse_thread/thread/8f17a74f14569b33) one might make a conclusion
that HAppS-State is just for getting rid of the config files or the
command line arguments. This was the only practical benefit I could
draw from that note. In the same thread Darrin Thompson made a good
point of how State might be viewed. And if I'm not misinterpreting
here that could be summed up as expressive relational operations with
more performance and without any DBMS. Now that sounds like a novel
approach to web application programming (I know that HAppS is not only
for web apps, but that's my primary interest here). At least, among
the platforms I know of.

As far as I remember, Darrin's opinion wasn't really commented upon.
And the new description of the State is more towards his idea, but not
completely.
So, it would be nice to hear from you whether Darrin's point of view
is applicable to the current incarnation of the State.
More general question would be: is the State capable of handling all
the application data in the system?

I would distinguish 2 kinds of web application data: let's call it
"temporary" and "permanent" (maybe you can find better names for it).
Temporary data is data which is read and written fairly often in a
short period of time. It could be also called "session data". Its
structure is usually simple. Its volume is low and it lives a short
life. So, RAM would be a good place to store it. As we have found in
another thread, the State is particularly suited for storing such a
data.
Just to illustrate what I mean. Consider a web-shop (like
amazon.com :)). In this kind of shops there is usually a "shopping
basket" where each customer can put goods that he would like to buy.
He could also check the basket and remove unnecessary items. So, this
data is written as often as read. It has a simple structure (just a
list of goods identifiers, possibly). And it is invalidated once the
visitor checks out the order or leaves the site. So it is an example
of "temporary" information in the system.
Permanent data is data which is read often, but is written seldom. Its
volume is high and the lifetime is high (for some applications,
preferably, infinite). It can have very complex structure and it is a
subject to relational operations. Because of the requirements it is
commonly stored in a relational database.
In the web-shop example there exists large amount of information about
the goods being traded as well as the information about the purchases
and the supplies for the back-end. This information has complex
relational structure. It is read much more often than written, e.g.
customers viewing the catalog more often than the administration
updates it. So this is "permanent" data.

It is interesting for me whether the State can handle this kind of
data at least as efficient and reliable as a database? I would guess
that not, just because of a dramatic difference in man-hours spent on
the State and, say, MySQL. But, let's hear what you say.

Additionally, it seems that because of the multimaster replication
mechanism we need to distinguish the computations that just read the
state from the computations that both read and write it. And, as far
as I get it, distinguishing should be done on the application level.
Now I wonder if it is possible to design an algorithm that wouldn't
require the application programmers to distinguish reading
computations from writing. I might be completely wrong here, so,
please, explain.

What is the design motivation behind the current type of the
ServerPart that doesn't seem to be a good candidate for using monad
transformers?

Another thing that I'm desperately trying to grasp is the "high
purpose" of HAppS. I guess, you have a clear vision of what HAppS is
and what it is not. And also, what would HAppS 1.0 feature. So, I
would like to understand it. Probably, it would be a good idea to
write somethings about HAppS that is more specific than "a framework
for developing Internet services quickly, deploying them easily,
scaling them massively, and managing them ziplessly". This all sounds
too vague and I just don't see how could it be done. This would really
help me and another 100 subscribers, who are watching, but not
commenting, to understand whether HAppS is something worth of waiting
for and, as in my case, contributing to.

If you think some of these questions have been answered somewhere,
please, excuse me and give a link. I might have easily missed the
answers in the tons of loosely structured info.

Still, HAppS looks like the most advanced web framework in the Haskell
world. And I just wish that it would progress towards stability and
usability.

Happy New Year! And let it be the Year of HAppS :)

Alex Jacobson

unread,

Dec 27, 2007, 11:00:54 AM12/27/07

to HA...@googlegroups.com

Andrey Chudnov wrote:
> Seriously speaking, I've got confused with the current description of
> the State.
>>From your previous note (http://groups.google.com/group/HAppS/
> browse_thread/thread/8f17a74f14569b33) one might make a conclusion
> that HAppS-State is just for getting rid of the config files or the
> command line arguments. This was the only practical benefit I could
> draw from that note. In the same thread Darrin Thompson made a good
> point of how State might be viewed. And if I'm not misinterpreting
> here that could be summed up as expressive relational operations with
> more performance and without any DBMS. Now that sounds like a novel
> approach to web application programming (I know that HAppS is not only
> for web apps, but that's my primary interest here). At least, among
> the platforms I know of.

I didn't comment because I didn't disagree. I find that people here
frequently do a better job explaining what is going on than I do. But
I'll expand on some of his themes here. People use rDBMS because they
provide

* global shared state
* ACID guarantees on that state
* relational operations to support interesting queries on that state

The problems with rDBMS are

* time spent writing coding to marshalling data to/from the database
* latency/cpu-time spent marshalling data back and forth
* query/update time as the database engaged in unnecessary disk ops
* potential inconsistency between app code and database schema
* deployment complexity involving making sure the database is created
* no support for graphs or other non-relational datastructures
* limited control over how your data is indexed
* data has to move to code to do interesting things
* requires vertical scaling

So the reason people use separate session stores like memcached is
because that data doesn't need a global shared state (it can be local to
the individual web server -- if you have session stickiness), it doesn't
need ACID guarantees (it is only updated when the user hits the store),
and it is typically just a lookup table so no complex relational queries
are required.

The point of HAppS-State is to get global shared state, acid guarantees
on state operations and relational operations without the problems. By
keeping all state in memory in haskell data structures so

* there is no marshalling between dataformats in different processeses
and accross a network.

* the type system helps guarantees you consistency

* you don't get arbitrary latency problems for 9ms disk seeks

* you don't have a separate database installation procedure before you
start using your app

* simple lookups are actually simple when you don't need relational stuff

* code moves to data so you can do more interesting things more directly

The other big bonus here is that we are implementing this using the
command pattern so we can do multimaster replication of state with a
framework like spread. Because code is separate from data in rdbms,
multimaster replication is much harder. The other big bonus here is
that we can use the type system to allow you to partition your data
accross multiple machines.

> As far as I remember, Darrin's opinion wasn't really commented upon.
> And the new description of the State is more towards his idea, but not
> completely.
> So, it would be nice to hear from you whether Darrin's point of view
> is applicable to the current incarnation of the State.

To be clear, multimaster and sharding are not built yet, but they seem
much easier than in traditional frameworks. The difficulty is making
this framework nice and usable from haskell while at the same time
taking advantage of these capabilities. A lot of the work has been a
search through API space to find a nice interface to these capabilities.

> More general question would be: is the State capable of handling all
> the application data in the system?

You would not use State to handle blobs. If you have video files, they
are better stored using e.g. amazon s3.

> I would distinguish 2 kinds of web application data: let's call it
> "temporary" and "permanent" (maybe you can find better names for it).
> Temporary data is data which is read and written fairly often in a
> short period of time. It could be also called "session data". Its
> structure is usually simple. Its volume is low and it lives a short
> life. So, RAM would be a good place to store it. As we have found in
> another thread, the State is particularly suited for storing such a
> data.

Disk is somewhat obsolete in large scale web operations because 9ms seek
times limit performance too much. If each request requires one seek
then you cap out at 100 requests per second per disk. Disk is much
cheaper than memory but it is much much slower. If you care at all
about performance you need to start thinking about how to keep all your
data in memory as often as possible. HAppS is designed for that.

> It is interesting for me whether the State can handle this kind of
> data at least as efficient and reliable as a database? I would guess
> that not, just because of a dramatic difference in man-hours spent on
> the State and, say, MySQL. But, let's hear what you say.

I don't see why not. The big advantage we have over MySQL is haskell
and its type system. Referential transparency allows us to move code to
data in a way that is not really possible in other languages. For
example, it means that we can checkpoint in memory state without forcing
all operations to stop in order to maintain consistency. It also means
that we can replay haskell code from the log without concern that it
will behave differently.

> Additionally, it seems that because of the multimaster replication
> mechanism we need to distinguish the computations that just read the
> state from the computations that both read and write it. And, as far
> as I get it, distinguishing should be done on the application level.
> Now I wonder if it is possible to design an algorithm that wouldn't
> require the application programmers to distinguish reading
> computations from writing. I might be completely wrong here, so,
> please, explain.

Yes, conceptually perhaps we could differentiate between updates based
on whether they use put or modify, but in practice I think there is
value in forcing the use to differentiate between queries and updates
and the values they return and the performance implications of each.

> What is the design motivation behind the current type of the
> ServerPart that doesn't seem to be a good candidate for using monad
> transformers?

Can you explain this one? The point of the new WebT monad is to support
monad transformers.

> Another thing that I'm desperately trying to grasp is the "high
> purpose" of HAppS. I guess, you have a clear vision of what HAppS is
> and what it is not.

To me, the purpose of HAppS is to eliminate major pain points and allow
you to focus on building your app and not infrastructure issues of
various sorts.

> And also, what would HAppS 1.0 feature. So, I
> would like to understand it.

* HAppS-State with multimaster and sharding and maybe multiple states
per process.

* HAppS-Server with the WebT monad design finished. Right now we are a
little in flux here, but quickly moving in the right direction. Also
would like http client and proxy code here.

* HAppS-Data with even less futzing, really cleaned up migrations.

* Zipless superscalable deployment on EC2

* A well integrated S3 client lib for blobs

* modules refactored into repositories correctly

There is more stuff probably, but qualitatively, HAppS 1.0 will allow me
to write a facebook or an ebay style app in a natural style for a single
box and have it scale enough to support the load they currently have
without substantial change at substantially less cost.

-Alex-

Andrey Chudnov

unread,

Dec 28, 2007, 7:46:27 AM12/28/07

to HAppS

Thank you for a detailed answer. Now everything is [almost] clear for
me.

> * simple lookups are actually simple when you don't need relational stuff

What happens when you do need relational stuff?!

> You would not use State to handle blobs. If you have video files, they
> are better stored using e.g. amazon s3.

What if I want to use my own storage capacities (hard drives) instead
of S3 (or any other hosting)?

> Disk is somewhat obsolete in large scale web operations because 9ms seek
> times limit performance too much. If each request requires one seek
> then you cap out at 100 requests per second per disk. Disk is much
> cheaper than memory but it is much much slower. If you care at all
> about performance you need to start thinking about how to keep all your
> data in memory as often as possible. HAppS is designed for that.

Does that mean that Haskell also deals with paging? Hard to believe :)
I mean, if the OS pages out a piece of data that you need most right
now - and you're in the hard-drive-seeking business again, aren't you?
Another concern is that in order to maintain consistency, you need to
write all the changes to the state in the log file - right? Wouldn't
it be a slowdown again?

> I don't see why not. The big advantage we have over MySQL is haskell
> and its type system.

I guess we will need to do comparative tests between HAppS-State and
mysql-memory database engine.

> Yes, conceptually perhaps we could differentiate between updates based
> on whether they use put or modify, but in practice I think there is
> value in forcing the use to differentiate between queries and updates
> and the values they return and the performance implications of each.

Well, yeah. But if I process an http request and I need to choose the
right type for my function depending on whether I read the state or
read/write the state - that's a bit inconvenient. Frankly speaking I
don't know how is multimaster built/planned to be built, but are you
sure there's no more user friendly way to optimize state manipulation?

> Can you explain this one? The point of the new WebT monad is to support
> monad transformers.

Yeah, that one is a transformer itself. Nice. But why is it in the
'alternative' http?!

> * Zipless superscalable deployment on EC2
> * A well integrated S3 client lib for blobs

That looks like fun features. But does/would HAppS support using my
own server farm and disk storage as well?

Alex Jacobson

unread,

Dec 28, 2007, 10:34:20 AM12/28/07

to HA...@googlegroups.com

Andrey Chudnov wrote:
> Thank you for a detailed answer. Now everything is [almost] clear for
> me.
>
>> * simple lookups are actually simple when you don't need relational stuff
>
> What happens when you do need relational stuff?!

You use HAppS-IxSet

>> You would not use State to handle blobs. If you have video files, they
>> are better stored using e.g. amazon s3.
>
> What if I want to use my own storage capacities (hard drives) instead
> of S3 (or any other hosting)?

Thats fine too. I just think S3 is a better option for most people.
You are totally free to use the filesystem or whatever you like.

>> Disk is somewhat obsolete in large scale web operations because 9ms seek
>> times limit performance too much. If each request requires one seek
>> then you cap out at 100 requests per second per disk. Disk is much
>> cheaper than memory but it is much much slower. If you care at all
>> about performance you need to start thinking about how to keep all your
>> data in memory as often as possible. HAppS is designed for that.
>
> Does that mean that Haskell also deals with paging? Hard to believe :)
> I mean, if the OS pages out a piece of data that you need most right
> now - and you're in the hard-drive-seeking business again, aren't you?
> Another concern is that in order to maintain consistency, you need to
> write all the changes to the state in the log file - right? Wouldn't
> it be a slowdown again?

No you are limited to state that fits in RAM on a single box. When we
implement sharding you will be able to partition state accross RAM on
multiple boxes. Writing to a logfile is much faster than seeking. You
can write 40mbps per second to a log. With multimaster, depending on
your configuration, you may decide you don't need disk even for log!

>> I don't see why not. The big advantage we have over MySQL is haskell
>> and its type system.
>
> I guess we will need to do comparative tests between HAppS-State and
> mysql-memory database engine.

With HAppS you don't pay for marshalling data from your test language to
ODBC accross a socket and back. You also get the type system
guaranteeing that your app is not assuming a different schema from your
data.

>> Yes, conceptually perhaps we could differentiate between updates based
>> on whether they use put or modify, but in practice I think there is
>> value in forcing the use to differentiate between queries and updates
>> and the values they return and the performance implications of each.
>
> Well, yeah. But if I process an http request and I need to choose the
> right type for my function depending on whether I read the state or
> read/write the state - that's a bit inconvenient. Frankly speaking I
> don't know how is multimaster built/planned to be built, but are you
> sure there's no more user friendly way to optimize state manipulation?

Its no different than deciding between GET and POST in http. With
HAppS-State it is just called query and update.

>> Can you explain this one? The point of the new WebT monad is to support
>> monad transformers.
>
> Yeah, that one is a transformer itself. Nice. But why is it in the
> 'alternative' http?!

It is now the standard. That is just the module name. It is also new.

>> * Zipless superscalable deployment on EC2
>> * A well integrated S3 client lib for blobs
>
> That looks like fun features. But does/would HAppS support using my
> own server farm and disk storage as well?

We don't know how to write code to add a server to your server farm so
you would have to do that yourself.

-Alex-

Andrey Chudnov

unread,

Dec 28, 2007, 5:42:48 PM12/28/07

to HAppS

> No you are limited to state that fits in RAM on a single box.

That's interesting! How do we make sure that the state that doesn't
fit in RAM (for whatever reason) doesn't get swapped to the hard disk
by the OS?

> We don't know how to write code to add a server to your server farm so
> you would have to do that yourself.

Ok :) But I thought multimaster/sharding is just for executing the
application on several machines and balancing the workload between
them.
BTW, are you basing your multimaster and sharding implementation on
well know algorithms or developing new ones?

P.S.: thanks for keeping up with answers.

Alex Jacobson

unread,

Dec 28, 2007, 5:51:41 PM12/28/07

to HA...@googlegroups.com

Andrey Chudnov wrote:
>> No you are limited to state that fits in RAM on a single box.
>
> That's interesting! How do we make sure that the state that doesn't
> fit in RAM (for whatever reason) doesn't get swapped to the hard disk
> by the OS?

How do you know you aren't going to run out of disk?

>> We don't know how to write code to add a server to your server farm so
>> you would have to do that yourself.
>
> Ok :) But I thought multimaster/sharding is just for executing the
> application on several machines and balancing the workload between
> them.

Yes, but there are a lot of sublte deployment assumptions required to
get it all working properly. You will be able to do it manually. It is
just going to hurt more.

> BTW, are you basing your multimaster and sharding implementation on
> well know algorithms or developing new ones?

multimaster will be based on http://spread.org. sharding will be based
on adding a split class that is the inverse of monoid.

-Alex-

Stefan O'Rear

unread,

Dec 28, 2007, 6:03:07 PM12/28/07

to HA...@googlegroups.com

On Fri, Dec 28, 2007 at 02:42:48PM -0800, Andrey Chudnov wrote:
>
> > No you are limited to state that fits in RAM on a single box.
>
> That's interesting! How do we make sure that the state that doesn't
> fit in RAM (for whatever reason) doesn't get swapped to the hard disk
> by the OS?

You can't. Data has to be somewhere, and if RAM is full it has to go
somewhere else. How can you have more state than will fit in RAM and
*not* swap?

Stefan

signature.asc

Darrin Thompson

unread,

Dec 28, 2007, 6:27:28 PM12/28/07

to HA...@googlegroups.com

On Dec 28, 2007 6:03 PM, Stefan O'Rear <stef...@cox.net> wrote:
> You can't. Data has to be somewhere, and if RAM is full it has to go
> somewhere else. How can you have more state than will fit in RAM and
> *not* swap?
>

It does lead to some interesting questions.

I think it was the Prevaylor people who did some informal tests on
what would happen performance wise if you had enough data to force
swapping. Their conclusion was don't do that. IIRC, the Java garbage
collector (at the time, some years ago) didn't perform well for
partially swapped processes. In practice you got a lot of thrashing.

I wonder what would happen with HAppS state? I'm too lazy to find out
for myself but eventually somebody's going to do the experiment, maybe
by accident. :-)

--
Darrin

mauricecodik

unread,

Dec 28, 2007, 9:05:22 PM12/28/07

to HAppS

This is one of the reasons why I'm not a huge fan of HAppS-State as it
currently works or where it's heading. If you're running on cheap
commodity hardware, you will get to the point where your full dataset
is larger than RAM fairly quickly if your application becomes
successful.

There are a few ways to deal with this:
A. Not care. The OS will swap things in/out as it needs. This is what
you're forced to do with the current HAppS-State, AFAIK. You are
limited to one box and the amount of RAM on it. For me, this is a deal
breaker and why I'm not using HAppS-State in production right now.

B. Reverse the situation: Keep all your data on disk, and cache as
much of it in memory as you can. This way you can control whats in
memory and what isnt, instead of leaving that up to the OS. IIRC,
berkeley dbs work this way and they are very fast/scalable. The
reality is that you dont have to keep *all* of your data in memory to
be fast, only the subset of it that's currently needed. Since you know
more about your data than the OS does, you can be smarter about what
you keep in RAM and what stays on disk with this approach than with A.

C. Implement some sort of sharding support, which is what HAppS
development seems to be heading towards.

C certainly sounds sexier, but I don't think it solves the dataset
size problem very efficiently: Suppose I'm using HAppS with
multimaster and sharding, and my dataset is 20gb and each of my
servers has 8gb of ram. I now realistically 6 servers: 3 shards to fit
my data, with two boxes per shard for redudancy. This just the minimum
amount of hardware to redundantly hold on to my data. Because all the
data is in memory, this will probably be plenty fast, but my monthly
hosting bill is ~3x more than it needs to be, and I've spent a lot of
cash on hardware I now need to manage. On top of this, I now have to
deal with rejuggling my shards when I have to add new ones as my
dataset grows, and all the extra complexity that normally comes with
shards.

In this same scenario, suppose we had B + multimaster. I'd only need a
minimum of 2 boxes to redundantly hold on to everything. As my working
set grows, performance will degrade because cache misses will become
more common, and the app will have to go to disk more regularly.
However, I can still add boxes to reduce disk contention on my cache
misses and still scale out without much pain. Sharding is still an
option, but isnt a requirement to scale.

There is always a tradeoff between performance and cost. HAppS-Data
seems to be going the very high perf, high cost route.. but having
good performance for cheaper (plus all the other benefits of HAppS-
State) might be a better bet. What do you guys think... is there
something I'm missing here about why going RAM only + sharding is
better?

Maurice

> signature.asc
> 1KDownload

Andrey Chudnov

unread,

Dec 29, 2007, 6:59:38 AM12/29/07

to HAppS

> You can't. Data has to be somewhere, and if RAM is full it has to go
> somewhere else. How can you have more state than will fit in RAM and
> *not* swap?

Yep! That's the point I was desperately trying to make. If your data
gets swapped away to the disk, then you get the same 9ms (or whatever
the modern disks have) delay when it gets a hit.

2Alex. I don't say that State currently has a bad design. But one
cannot be sure that it would necessarily be faster than if we used an
rdbms. So, we DO need tests. The question here is HAppS-State mature
enough to be a subject of such tests?
On the other hand HAppS-State appears to be much better in terms of
scalability and ease of use than any rdbms.

Lemmih

unread,

Dec 29, 2007, 10:35:10 AM12/29/07

to HA...@googlegroups.com

It is by far the easiest to only use Haskell state (that is, in-memory
structures). As Alex said, it is type-safe, it eliminates the need for
serialization and we know how to scale it.
That said, there's no theoretical reason why you shouldn't be able to
use your favorite DB in conjunction with HAppS-State. If you can
satisfy the following things, you'll get the best of both worlds (ie.
scalability, reliability and disk-based storage):
* the DB should not store anything permanently.
* the entire contents of the DB needs to be serializable. That is,
DBConn -> Lazy.ByteString, Lazy.ByteString -> DBConn.
* the contents of the DB needs to be splittable and mergeable.
Splitting usually involves some bit operations over an ID number and
merging even more trivial.

Satisfying those requirements is very easy with pure Haskell
structures. However, it should be doable with, say, SQLite.

--
Cheers,
Lemmih

Darrin Thompson

unread,

Dec 29, 2007, 12:16:29 PM12/29/07

to HA...@googlegroups.com

Same goes for rdbms. If the disk is full then maybe the service goes
down until it's remedied. Or maybe you get no new writes which casuses
another service to die. ;-)

The most likely scenario is that once you exceed physical ram size
with HAppS state you will experience thrashing which will render your
service unusable, essentially down. Tests might prove otherwise, but I
doubt it.

Until someone proves otherwise, you really have to assume that
physical RAM is the upper bound on data size and live with that, and
monitor your production environment the same as you would your normal
databases.

--
Darrin

Lemmih

unread,

Dec 29, 2007, 12:25:18 PM12/29/07

to HA...@googlegroups.com

As a side note, it might be interesting to compare conventional DBs to
a disk-based version of IxSet.
HAppS-State takes care of reliability, transactions and concurrency.
Databases duplicate this functionality and should therefore have a
larger overhead than a plain IxSet.

--
Cheers,
Lemmih

Alex Jacobson

unread,

Dec 29, 2007, 1:05:42 PM12/29/07

to HA...@googlegroups.com

Raw disk is 1000 times cheaper than RAM for storage, but 1 million time
more expensive for performance.

If you are running e.g. a social networking app where every request
involves status lookups on all the user's friends. Now you need 1 seek
to get the friend list of the user and some number of seeks to get
current information on all of them because friends are randomly
distributed across all your disks and therefore unlikely to be all in
cache. If you are at 10 seeks per request and 10ms per seek then you
can handle 10 requests per second per disk. If you want to handle 1000
requests per second you need 100 disks...

If you are running an auction site where you make money from the long
tail of auctioned items and every request involves a lookup on all the
auctions in which the user is selling or interested in buying...

If you are running a web search engine and you are doing a lot of
computation on your state....

So if your app has a lot of data but not a lot of activity then perhaps
disk makes sense. But you also need to account for the cost of disk/dbms
in terms of application complexity including

* developer cost writing/managing code marshall data to/from your
external storage and increased risk of your database schema getting out
of sync with your appcode.

* DBA costs from monitoring the database and doing query optimization
to reduce disk seeks etc.

* sysadmin costs as your deployment complexity as you need multiple
disks, a vertically scaled database server, etc.

And the likelihood that in the end you use memcached to solve these
problems and end up paying for the memory anyway.

To be clear, I am not saying that you should never use disk. I am
saying for modern web applications you rapidly reach a point where
performance matters much more than storage and given the extra labor
cost of disk, you really need to have a lot of data even to begin
thinking about it. And when you do, you are probably going to have to
manage disk vs ram storage explicitly anyway because it is so app specific.

-Alex-

Maurice Codik

unread,

Dec 29, 2007, 2:33:30 PM12/29/07

to HA...@googlegroups.com

Let me clarify-- I'm not saying we should go to using an external rdbms. I agree that the db embedded in the application is a great way to go, for the reasons you mentioned (no need to write marshalling code, no dba, no external servers to monitor, etc). Using a local on disk db + memory cache doesnt have much additional cost over a regular app server: you just need an extra disk and put them in a simple stripped raid array.

I dont think your example of the social networking app is realistic-- in practice you aren't going to be forced to go to disk 100% of the time.. thats what the cache is for. A 10-20% cache miss rate is much more likely.

Also, I'd like to point out that in the case where your dataset size is less than your RAM, the performance characteristics of what I'm suggesting and being totally in memory are identical. It only begins to make a difference when you have more data that fits in RAM.

Maurice

Stefan O'Rear

unread,

Dec 29, 2007, 3:44:47 PM12/29/07

to HA...@googlegroups.com

It is to be noted, that disk latencies are not completely additive. If
one arbitrary request takes about 9ms, then twenty arbitrary requests
need only take 9ms, for we can sort the requests by cylinder address -
then the seek arm's motion will be monotone, and as such no greater than
the size of the disk. Many operating systems will do this optimization
(but only if the requests are all simultaneously visible, thus
necessitating either multiple processes or some form of asynchronous
I/O).

Stefan (who knows much less about Web apps than he does about computer
architecture, and is well aware that the preceeding may be irrelevant)

signature.asc

Justin T. Sampson

unread,

Dec 29, 2007, 10:28:39 PM12/29/07

to HA...@googlegroups.com

On Dec 29, 2007 10:05 AM, Alex Jacobson <al...@alexjacobson.com> wrote:

> To be clear, I am not saying that you should never use disk. I am
> saying for modern web applications you rapidly reach a point where
> performance matters much more than storage and given the extra labor
> cost of disk, you really need to have a lot of data even to begin
> thinking about it. And when you do, you are probably going to have to
> manage disk vs ram storage explicitly anyway because it is so app specific.

This is the most important point for me. The labor cost of getting an
app up and running with Prevayler (I haven't used HAppS yet) is so
much less in my experience than the labor cost of getting an app up
and running with a relational database, object-relational mapping,
etc., that you have plenty of time and money left over for
optimization. And you're starting from a blazing fast and extremely
clean implementation, so optimizing for your particular circumstances
is going to be way more effective. Starting out with a relational
database is one of the more insidious forms of premature optimization.
Until your data is pretty huge, you know it's going to be slower and
harder to maintain -- you're just hoping that your app will be
successful, and that the data needs will be just the kinds of data
needs that a relational database will help with.

In my experience, if your app always has the simplest architecture for
a specific performance target, increasing that performance target by
an order of magnitude tends to take a fixed labor cost. On one
project, we planned the development of scaling to specific performance
targets just like any other features. We asked the project sponsor for
some numbers describing how many users he hoped for at various stages
of launch, labeled A, B, and C. For early alpha testing, we scheduled
support for "1/10% of A". Then there was "1% of A", "10% of A", "A",
"B", "C"...

And, to be really extreme, we made one other feature explicit that
most projects don't: "System remembers data after restart." Before we
got around to that feature, everything was just clean, pure objects in
memory. That let us get pretty far with the data model and user
interface without worrying about persistence at all, which I believe
made us pretty darned productive. When that feature came up in the
development schedule, we dropped in Prevayler in just a couple of days
(most of which was intensive testing just to make sure it really
worked -- even *we* didn't quite believe it could be that easy!). We
even contributed some enhancements back to Prevayler, which ended up
in the Prevayler 2 release.

After that, we had some more nice productive time before getting
around to the "1/10% of A" feature. It turned out that our
super-simple code using Prevayler already met that target, so the
labor required was just some time spent on more automated testing.

Each of the next steps (tackled weeks or months apart) was a similar
fixed piece of work, a couple developers taking less than a week for
10x improvement. For "1% of A", I think it was user-uploaded photos
that started growing past memory bounds (and we weren't even testing
against our deployment server, which would have a couple gigs of
memory -- we were giving ourselves the harder goal of meeting each
performance target on a dev machine with maybe one gig and lots of dev
tools running). So, as Alex described, we just moved those "blobs"
outside of memory and stored them as files on disk. The nice thing
here is that it was such a tiny part of the code accounting for such a
huge chunk of data -- the vast majority of our code could stay
extremely simple and cleanly object-oriented.

For "10% of A" we identified another single kind of data that was
growing out of bounds. We were pretty happy about it, too -- we had
used the simplest and clearest data structure for that data, which had
worked just fine up until then; and it turned out that we didn't have
to move it out of memory at all, we just switched to a slightly less
simple data structure which vastly reduced its memory usage. (It was a
LinkedHashMap<Long, Long> which we replaced with a String "key=value"
encoding.) Again, the vast majority of code got to stay simple and
clean, with just one tiny hotspot being optimized.

I left the project at that point, but we already had plenty of ideas
for additional steps -- using shards, putting more data in blobs on
disk, etc. My point is, we were able to focus our development effort
very effectively on exactly what our particular app needed. And even
if there did come a time when it seemed appropriate to put some of the
data into a database, I'm confident that the vast majority of code
would still be dealing with clean, simple objects in memory.

(There's one other reason that people often put data in a database --
because there's some established enterprise reporting engine. In that
case, the database is really being treated as a full-blown app itself,
which I would just treat as I would treat any external app
integration: Asynchronously, without creating any hard dependencies
between my app and the database. Exporting data from time to time to a
database is much easier than trying to build my whole app on top of
one.)

Cheers,
Justin

Luc TAESCH

unread,

Jan 7, 2008, 4:49:48 AM1/7/08

to HA...@googlegroups.com

IMHO,

This answers ( and related discussion) seems instrumental for me,

and should ideally be kept and documented somewhere in ( or near ) our Happs Architecture Plea..

Indeed, This is the essence of the "Paradigm Switch" or "mindset switch" needed when going from LAMP to "modern" web application development..

Most People that do not have that clear in mind cannot really understand why they should bother switching from LAMP/DB to something else, and if you do not understand the problem , you cannot appreciate the solution...

( I cannot volonteer for lack of ressource and possibly knowledge, but at least I would like to share my feeling when I encounter "clarity" :-)

2007/12/30, Justin T. Sampson <justint...@gmail.com>:

Sterling Clover

unread,

Jan 20, 2008, 8:47:43 PM1/20/08

to HA...@googlegroups.com

I'm just trying to get the basics up and running with the current
HAppS so that I can figure out how to hook HStringTemplate in
elegantly, and I'm encountering the following in the AllIn example
when I query /setGet:

New event: <<GetComponent Int>>
HTTP request failed with: src/HAppS/Data/Xml/HaXml.hs:17:19-42:
Irrefutable pattern failed for pattern Text.XML.HaXml.Types.CElem el'

New event: <<CleanSessions [Char]>>

Any clue as to what can be done?

Thanks much,
Sterl.

Lemmih

unread,

Jan 21, 2008, 7:44:57 AM1/21/08

to HA...@googlegroups.com

AllIn.hs is simply broken. HAppS-HEAD is changing too rapidly for us
to keep the examples working.

It should be noted that stability is important to us and we're working
hard on nailing down the back-end in a robust matter.

--
Cheers,
Lemmih

Alex Jacobson

unread,

Jan 21, 2008, 8:37:51 PM1/21/08

to HA...@googlegroups.com

I plan to lock sp to something other than head once we get to the next
release. The open issues right now is the interface to happs-state.