== High Level Framework ==
* HAppS-State -- Global in memory Haskell state with ACID guarantees.
Unplug your machine and restart and have your app recover to exactly
where it left off. HAppS-State spares you the need to deal with all the
marshalling, consistency, and configuration headache that you would have
if you used an external DBMS of this purpose. Its elegant component
model starts maintenance routines in background without you having to
futz. Is provides event subscription to triggers IO actions and support
comet-style or irc-bot applications.
[Move Serialize into HAppS-Data]
* Happs-IxSet -- Efficient relational queries on Haskell sets.
No need to marshall your data to/from an external relational database
just for efficient queries, just pick which parts of your data
structures you want indexed. IxSet relies on generics and TH to spare
you the boilerplate normally required for such tasks.
* HApps-Data -- Automagically convert haskell types to/from XML and
name-value pairs for use in AJAX, web forms, and state serialization.
Supports url-encoded or mime/multipart, type-specific representations,
type-safe migrations, and public/private differentiation. Let haskell's
compiler help you make sure that, as the shape of your application types
changes, you don't lose consistency with persistent internal state or
the web forms/cookies waiting to be posted.
[MessageWrap, XSLT, Serialize, and Crypto moved in.
Arguably GOpS, HList, and Atom should be moved out, but where to put them?]
* HAppS-Server -- High performance dynamic web services framework.
Comes with elegant monadic framework for transforming requests into
responses. Use either as a public facing web server or behind a caching
proxy server via fast-cgi. Supports long transactions for ajax/comet-apps.
[Move MessageWrap and XSLT to HAppS-Data. Move DNS client code out from
this repo to HAppS-DNS. Move SMTP and UDP code to HAppS-Network?]
* HAppS-Agents -- Sessions, FlashMsgs, HelpReqs, and maybe soon Mail,
IRCBot, and UserAccounts.
Stateful stuff that sometimes needs to take initiative. Common
components of many standard web apps.
[Currently stuff in HAppS.Server.Store. New hierarchy, HAppS.Agent.*
Could this have a better name?]
== Build/Deploy Support ==
* HAppS-Begin -- Get started quick skeleton on which to hang your HAppS
apps.
Do a "darcs get http://happs.org/HAppS-Begin MyProject; cd MyProject; sp
ghc -ihaskell haskell/Main.hs --run." Then open your browser to
http://localhost:5000 and see "hello world." Modify the source and
reload the page. Let searchpath take care of auto-rebuilding and
restarting. Watch the shell window for error messages.
[Will probably strip out the stuff currently in HAppS-Begin and push it
into real examples and documentation.]
* SearchPath -- Automatic recursive import chasing across the internet
with auto-recompile/restart for easy all-in-one application builds and
development.
Supports building from haskell source exposed either via a URL based
directory hierarchy, a TGZ file accessible at a URL, or in any shell
based version control system including darcs, svn, cvs, etc. Have your
app rebuild and restart automatically as you change source so it feels
like you are working in an interpreted language.
== Infrastructure ==
* HApps-DNS -- Pure Haskell DNS resolver.
No need to link to C-libs and suffer build complexity.
[To be used by future mail code. Should have a separate existence.]
* HAppS-Protocols -- HTTP, SMTP, UDP, and IRC types and utility code.
[I hope to kill HAppS-Util]
----
Thoughts? Should we come up with better names?
-Alex-
PS HAppSy holidays! (couldn't help myself.)
I didn't comment because I didn't disagree. I find that people here
frequently do a better job explaining what is going on than I do. But
I'll expand on some of his themes here. People use rDBMS because they
provide
* global shared state
* ACID guarantees on that state
* relational operations to support interesting queries on that state
The problems with rDBMS are
* time spent writing coding to marshalling data to/from the database
* latency/cpu-time spent marshalling data back and forth
* query/update time as the database engaged in unnecessary disk ops
* potential inconsistency between app code and database schema
* deployment complexity involving making sure the database is created
* no support for graphs or other non-relational datastructures
* limited control over how your data is indexed
* data has to move to code to do interesting things
* requires vertical scaling
So the reason people use separate session stores like memcached is
because that data doesn't need a global shared state (it can be local to
the individual web server -- if you have session stickiness), it doesn't
need ACID guarantees (it is only updated when the user hits the store),
and it is typically just a lookup table so no complex relational queries
are required.
The point of HAppS-State is to get global shared state, acid guarantees
on state operations and relational operations without the problems. By
keeping all state in memory in haskell data structures so
* there is no marshalling between dataformats in different processeses
and accross a network.
* the type system helps guarantees you consistency
* you don't get arbitrary latency problems for 9ms disk seeks
* you don't have a separate database installation procedure before you
start using your app
* simple lookups are actually simple when you don't need relational stuff
* code moves to data so you can do more interesting things more directly
The other big bonus here is that we are implementing this using the
command pattern so we can do multimaster replication of state with a
framework like spread. Because code is separate from data in rdbms,
multimaster replication is much harder. The other big bonus here is
that we can use the type system to allow you to partition your data
accross multiple machines.
> As far as I remember, Darrin's opinion wasn't really commented upon.
> And the new description of the State is more towards his idea, but not
> completely.
> So, it would be nice to hear from you whether Darrin's point of view
> is applicable to the current incarnation of the State.
To be clear, multimaster and sharding are not built yet, but they seem
much easier than in traditional frameworks. The difficulty is making
this framework nice and usable from haskell while at the same time
taking advantage of these capabilities. A lot of the work has been a
search through API space to find a nice interface to these capabilities.
> More general question would be: is the State capable of handling all
> the application data in the system?
You would not use State to handle blobs. If you have video files, they
are better stored using e.g. amazon s3.
> I would distinguish 2 kinds of web application data: let's call it
> "temporary" and "permanent" (maybe you can find better names for it).
> Temporary data is data which is read and written fairly often in a
> short period of time. It could be also called "session data". Its
> structure is usually simple. Its volume is low and it lives a short
> life. So, RAM would be a good place to store it. As we have found in
> another thread, the State is particularly suited for storing such a
> data.
Disk is somewhat obsolete in large scale web operations because 9ms seek
times limit performance too much. If each request requires one seek
then you cap out at 100 requests per second per disk. Disk is much
cheaper than memory but it is much much slower. If you care at all
about performance you need to start thinking about how to keep all your
data in memory as often as possible. HAppS is designed for that.
> It is interesting for me whether the State can handle this kind of
> data at least as efficient and reliable as a database? I would guess
> that not, just because of a dramatic difference in man-hours spent on
> the State and, say, MySQL. But, let's hear what you say.
I don't see why not. The big advantage we have over MySQL is haskell
and its type system. Referential transparency allows us to move code to
data in a way that is not really possible in other languages. For
example, it means that we can checkpoint in memory state without forcing
all operations to stop in order to maintain consistency. It also means
that we can replay haskell code from the log without concern that it
will behave differently.
> Additionally, it seems that because of the multimaster replication
> mechanism we need to distinguish the computations that just read the
> state from the computations that both read and write it. And, as far
> as I get it, distinguishing should be done on the application level.
> Now I wonder if it is possible to design an algorithm that wouldn't
> require the application programmers to distinguish reading
> computations from writing. I might be completely wrong here, so,
> please, explain.
Yes, conceptually perhaps we could differentiate between updates based
on whether they use put or modify, but in practice I think there is
value in forcing the use to differentiate between queries and updates
and the values they return and the performance implications of each.
> What is the design motivation behind the current type of the
> ServerPart that doesn't seem to be a good candidate for using monad
> transformers?
Can you explain this one? The point of the new WebT monad is to support
monad transformers.
> Another thing that I'm desperately trying to grasp is the "high
> purpose" of HAppS. I guess, you have a clear vision of what HAppS is
> and what it is not.
To me, the purpose of HAppS is to eliminate major pain points and allow
you to focus on building your app and not infrastructure issues of
various sorts.
> And also, what would HAppS 1.0 feature. So, I
> would like to understand it.
* HAppS-State with multimaster and sharding and maybe multiple states
per process.
* HAppS-Server with the WebT monad design finished. Right now we are a
little in flux here, but quickly moving in the right direction. Also
would like http client and proxy code here.
* HAppS-Data with even less futzing, really cleaned up migrations.
* Zipless superscalable deployment on EC2
* A well integrated S3 client lib for blobs
* modules refactored into repositories correctly
There is more stuff probably, but qualitatively, HAppS 1.0 will allow me
to write a facebook or an ebay style app in a natural style for a single
box and have it scale enough to support the load they currently have
without substantial change at substantially less cost.
-Alex-
You use HAppS-IxSet
>> You would not use State to handle blobs. If you have video files, they
>> are better stored using e.g. amazon s3.
>
> What if I want to use my own storage capacities (hard drives) instead
> of S3 (or any other hosting)?
Thats fine too. I just think S3 is a better option for most people.
You are totally free to use the filesystem or whatever you like.
>> Disk is somewhat obsolete in large scale web operations because 9ms seek
>> times limit performance too much. If each request requires one seek
>> then you cap out at 100 requests per second per disk. Disk is much
>> cheaper than memory but it is much much slower. If you care at all
>> about performance you need to start thinking about how to keep all your
>> data in memory as often as possible. HAppS is designed for that.
>
> Does that mean that Haskell also deals with paging? Hard to believe :)
> I mean, if the OS pages out a piece of data that you need most right
> now - and you're in the hard-drive-seeking business again, aren't you?
> Another concern is that in order to maintain consistency, you need to
> write all the changes to the state in the log file - right? Wouldn't
> it be a slowdown again?
No you are limited to state that fits in RAM on a single box. When we
implement sharding you will be able to partition state accross RAM on
multiple boxes. Writing to a logfile is much faster than seeking. You
can write 40mbps per second to a log. With multimaster, depending on
your configuration, you may decide you don't need disk even for log!
>> I don't see why not. The big advantage we have over MySQL is haskell
>> and its type system.
>
> I guess we will need to do comparative tests between HAppS-State and
> mysql-memory database engine.
With HAppS you don't pay for marshalling data from your test language to
ODBC accross a socket and back. You also get the type system
guaranteeing that your app is not assuming a different schema from your
data.
>> Yes, conceptually perhaps we could differentiate between updates based
>> on whether they use put or modify, but in practice I think there is
>> value in forcing the use to differentiate between queries and updates
>> and the values they return and the performance implications of each.
>
> Well, yeah. But if I process an http request and I need to choose the
> right type for my function depending on whether I read the state or
> read/write the state - that's a bit inconvenient. Frankly speaking I
> don't know how is multimaster built/planned to be built, but are you
> sure there's no more user friendly way to optimize state manipulation?
Its no different than deciding between GET and POST in http. With
HAppS-State it is just called query and update.
>> Can you explain this one? The point of the new WebT monad is to support
>> monad transformers.
>
> Yeah, that one is a transformer itself. Nice. But why is it in the
> 'alternative' http?!
It is now the standard. That is just the module name. It is also new.
>> * Zipless superscalable deployment on EC2
>> * A well integrated S3 client lib for blobs
>
> That looks like fun features. But does/would HAppS support using my
> own server farm and disk storage as well?
We don't know how to write code to add a server to your server farm so
you would have to do that yourself.
-Alex-
How do you know you aren't going to run out of disk?
>> We don't know how to write code to add a server to your server farm so
>> you would have to do that yourself.
>
> Ok :) But I thought multimaster/sharding is just for executing the
> application on several machines and balancing the workload between
> them.
Yes, but there are a lot of sublte deployment assumptions required to
get it all working properly. You will be able to do it manually. It is
just going to hurt more.
> BTW, are you basing your multimaster and sharding implementation on
> well know algorithms or developing new ones?
multimaster will be based on http://spread.org. sharding will be based
on adding a split class that is the inverse of monoid.
-Alex-
You can't. Data has to be somewhere, and if RAM is full it has to go
somewhere else. How can you have more state than will fit in RAM and
*not* swap?
Stefan
It does lead to some interesting questions.
I think it was the Prevaylor people who did some informal tests on
what would happen performance wise if you had enough data to force
swapping. Their conclusion was don't do that. IIRC, the Java garbage
collector (at the time, some years ago) didn't perform well for
partially swapped processes. In practice you got a lot of thrashing.
I wonder what would happen with HAppS state? I'm too lazy to find out
for myself but eventually somebody's going to do the experiment, maybe
by accident. :-)
--
Darrin
It is by far the easiest to only use Haskell state (that is, in-memory
structures). As Alex said, it is type-safe, it eliminates the need for
serialization and we know how to scale it.
That said, there's no theoretical reason why you shouldn't be able to
use your favorite DB in conjunction with HAppS-State. If you can
satisfy the following things, you'll get the best of both worlds (ie.
scalability, reliability and disk-based storage):
* the DB should not store anything permanently.
* the entire contents of the DB needs to be serializable. That is,
DBConn -> Lazy.ByteString, Lazy.ByteString -> DBConn.
* the contents of the DB needs to be splittable and mergeable.
Splitting usually involves some bit operations over an ID number and
merging even more trivial.
Satisfying those requirements is very easy with pure Haskell
structures. However, it should be doable with, say, SQLite.
--
Cheers,
Lemmih
Same goes for rdbms. If the disk is full then maybe the service goes
down until it's remedied. Or maybe you get no new writes which casuses
another service to die. ;-)
The most likely scenario is that once you exceed physical ram size
with HAppS state you will experience thrashing which will render your
service unusable, essentially down. Tests might prove otherwise, but I
doubt it.
Until someone proves otherwise, you really have to assume that
physical RAM is the upper bound on data size and live with that, and
monitor your production environment the same as you would your normal
databases.
--
Darrin
As a side note, it might be interesting to compare conventional DBs to
a disk-based version of IxSet.
HAppS-State takes care of reliability, transactions and concurrency.
Databases duplicate this functionality and should therefore have a
larger overhead than a plain IxSet.
--
Cheers,
Lemmih
If you are running e.g. a social networking app where every request
involves status lookups on all the user's friends. Now you need 1 seek
to get the friend list of the user and some number of seeks to get
current information on all of them because friends are randomly
distributed across all your disks and therefore unlikely to be all in
cache. If you are at 10 seeks per request and 10ms per seek then you
can handle 10 requests per second per disk. If you want to handle 1000
requests per second you need 100 disks...
If you are running an auction site where you make money from the long
tail of auctioned items and every request involves a lookup on all the
auctions in which the user is selling or interested in buying...
If you are running a web search engine and you are doing a lot of
computation on your state....
So if your app has a lot of data but not a lot of activity then perhaps
disk makes sense. But you also need to account for the cost of disk/dbms
in terms of application complexity including
* developer cost writing/managing code marshall data to/from your
external storage and increased risk of your database schema getting out
of sync with your appcode.
* DBA costs from monitoring the database and doing query optimization
to reduce disk seeks etc.
* sysadmin costs as your deployment complexity as you need multiple
disks, a vertically scaled database server, etc.
And the likelihood that in the end you use memcached to solve these
problems and end up paying for the memory anyway.
To be clear, I am not saying that you should never use disk. I am
saying for modern web applications you rapidly reach a point where
performance matters much more than storage and given the extra labor
cost of disk, you really need to have a lot of data even to begin
thinking about it. And when you do, you are probably going to have to
manage disk vs ram storage explicitly anyway because it is so app specific.
-Alex-
It is to be noted, that disk latencies are not completely additive. If
one arbitrary request takes about 9ms, then twenty arbitrary requests
need only take 9ms, for we can sort the requests by cylinder address -
then the seek arm's motion will be monotone, and as such no greater than
the size of the disk. Many operating systems will do this optimization
(but only if the requests are all simultaneously visible, thus
necessitating either multiple processes or some form of asynchronous
I/O).
Stefan (who knows much less about Web apps than he does about computer
architecture, and is well aware that the preceeding may be irrelevant)
> To be clear, I am not saying that you should never use disk. I am
> saying for modern web applications you rapidly reach a point where
> performance matters much more than storage and given the extra labor
> cost of disk, you really need to have a lot of data even to begin
> thinking about it. And when you do, you are probably going to have to
> manage disk vs ram storage explicitly anyway because it is so app specific.
This is the most important point for me. The labor cost of getting an
app up and running with Prevayler (I haven't used HAppS yet) is so
much less in my experience than the labor cost of getting an app up
and running with a relational database, object-relational mapping,
etc., that you have plenty of time and money left over for
optimization. And you're starting from a blazing fast and extremely
clean implementation, so optimizing for your particular circumstances
is going to be way more effective. Starting out with a relational
database is one of the more insidious forms of premature optimization.
Until your data is pretty huge, you know it's going to be slower and
harder to maintain -- you're just hoping that your app will be
successful, and that the data needs will be just the kinds of data
needs that a relational database will help with.
In my experience, if your app always has the simplest architecture for
a specific performance target, increasing that performance target by
an order of magnitude tends to take a fixed labor cost. On one
project, we planned the development of scaling to specific performance
targets just like any other features. We asked the project sponsor for
some numbers describing how many users he hoped for at various stages
of launch, labeled A, B, and C. For early alpha testing, we scheduled
support for "1/10% of A". Then there was "1% of A", "10% of A", "A",
"B", "C"...
And, to be really extreme, we made one other feature explicit that
most projects don't: "System remembers data after restart." Before we
got around to that feature, everything was just clean, pure objects in
memory. That let us get pretty far with the data model and user
interface without worrying about persistence at all, which I believe
made us pretty darned productive. When that feature came up in the
development schedule, we dropped in Prevayler in just a couple of days
(most of which was intensive testing just to make sure it really
worked -- even *we* didn't quite believe it could be that easy!). We
even contributed some enhancements back to Prevayler, which ended up
in the Prevayler 2 release.
After that, we had some more nice productive time before getting
around to the "1/10% of A" feature. It turned out that our
super-simple code using Prevayler already met that target, so the
labor required was just some time spent on more automated testing.
Each of the next steps (tackled weeks or months apart) was a similar
fixed piece of work, a couple developers taking less than a week for
10x improvement. For "1% of A", I think it was user-uploaded photos
that started growing past memory bounds (and we weren't even testing
against our deployment server, which would have a couple gigs of
memory -- we were giving ourselves the harder goal of meeting each
performance target on a dev machine with maybe one gig and lots of dev
tools running). So, as Alex described, we just moved those "blobs"
outside of memory and stored them as files on disk. The nice thing
here is that it was such a tiny part of the code accounting for such a
huge chunk of data -- the vast majority of our code could stay
extremely simple and cleanly object-oriented.
For "10% of A" we identified another single kind of data that was
growing out of bounds. We were pretty happy about it, too -- we had
used the simplest and clearest data structure for that data, which had
worked just fine up until then; and it turned out that we didn't have
to move it out of memory at all, we just switched to a slightly less
simple data structure which vastly reduced its memory usage. (It was a
LinkedHashMap<Long, Long> which we replaced with a String "key=value"
encoding.) Again, the vast majority of code got to stay simple and
clean, with just one tiny hotspot being optimized.
I left the project at that point, but we already had plenty of ideas
for additional steps -- using shards, putting more data in blobs on
disk, etc. My point is, we were able to focus our development effort
very effectively on exactly what our particular app needed. And even
if there did come a time when it seemed appropriate to put some of the
data into a database, I'm confident that the vast majority of code
would still be dealing with clean, simple objects in memory.
(There's one other reason that people often put data in a database --
because there's some established enterprise reporting engine. In that
case, the database is really being treated as a full-blown app itself,
which I would just treat as I would treat any external app
integration: Asynchronously, without creating any hard dependencies
between my app and the database. Exporting data from time to time to a
database is much easier than trying to build my whole app on top of
one.)
Cheers,
Justin
New event: <<GetComponent Int>>
HTTP request failed with: src/HAppS/Data/Xml/HaXml.hs:17:19-42:
Irrefutable pattern failed for pattern Text.XML.HaXml.Types.CElem el'
New event: <<CleanSessions [Char]>>
Any clue as to what can be done?
Thanks much,
Sterl.
AllIn.hs is simply broken. HAppS-HEAD is changing too rapidly for us
to keep the examples working.
It should be noted that stability is important to us and we're working
hard on nailing down the back-end in a robust matter.
--
Cheers,
Lemmih
We are trying to make it more solid so we can shift to multimaster.
-Alex-