I did think about moving this logic to the database, but I am toying around with a different model - having the entire data set in memory (possibly across multiple nodes using messaging infrastructure to communicate). The reason for this is:
- writes are very small but reads are very high
- each read typically requires complex processing
- most operations cover a large part of the entire dataset
Paying the cost of having the entire data set *efficiently* available for the application (Clojure in this case) means:
- less dependence on (probably hard to test) yet-another-bit-of-tech. Integration testing DAOs or Repositories always seems like a lot of work. Reducing the technical pieces just makes things much easier
- I am hoping clever use of persistent structures will help here, as there is a lot of commonality in the data itself (i.e. 5 projects might actually share 80% of the same state). Clever use in constructing these might pay dividends...
- I don't think I can offload *all* processing onto a third party technology so I need the ability to deal with large data sets in memory with real-time (whatever that means) - if I need it for one, I may as well use it for all.
Ambitious, and full of hairy concerns! But the idea of moving away from single-threaded web-based applications with big powerful data engines to a single chunk of logic that occasionally throws state to a fairly dumb persistent store is certainly not new ground, and seems to offer a much more powerful architecture.
For example, dealing with historical data is always a pain point. What I want is the ability to snapshot the entire system whenever anything changes, to allow us to see how the system (or client rather) has improved. In a relational database, this would be ridiculous, so I captured a "snapshot of interesting data". Tomorrow they realise that something else was interesting.... We also played with document stores (MongoDB) which makes the job much much smaller - just cloning a single document (and related data), but then it has to be hydrated, so for ease of use a snapshot is taken every X period, even if the data hasn't changed. Yuck.
Now Clojure appears, with its extremely efficient (in terms of memory) way of storing data, and suddenly it feels like storing a representation every time the structure changes (which is only once or twice a week) and then realising the entire history in memory is now do-able. This means if a Project only changed 5 times over a 3 month period there would only be 5 instances of that project in storage. Calculating how each project contributes to a historical chart broken down by day (or hour whatever) is much much easier to do in Java/Clojure/whatever than third party store of choice. I am asserting that providing a sequence for a project for every day over the last year when there are only 5 snapshots will certainly not consume sizeOfProject * daysInYear memory.
(Not sure that was the best example of the pain points I am trying to solve actually :), but anyway).
I guess, after 15 years of using the "web, app-logic, database" template-cutter I am giving myself a clean piece of paper and asking "what do you want to do and what is the simplest way to do it", and keeping everything in the application layer (rather than the persistence layer) seems appealing.
We aren't dealing with billions of rows - I still need to experiment, but it feels like having our entire data set in memory is possible on a fairly beefy server. I appreciate the JVM isn't the best wrt huge heaps, but I can work around that (with multiple virtual machines each running their own JVM and using ActiveMQ for example). Clojure's STM seems to be the final step on the ladder to reach this goal.
I have previously considered CouchDB (for its views), Hadoop (for its highly scalable and parallelisable map/reduce execution), Cassandra for its ability to store huge amounts of highly nested structure, Neo4j to store large numbers of small nodes that are heavily inter-related. And of course, MongoDB, which I am currently using in production. I also considered Erlang and Scala for their distributed VM actor models, but I am really really sold on the power of LISP macros.
I dunno - might be a fool's errand, but spreading the complexity over that much technology just seems like hard work. *If* the working set can be stored in current memory then I think a much simpler, and much more powerful solution will emerge. Sure, I am putting all my eggs in Clojure+my-own-ability, but at risk of re-inventing the wheel, but maybe that is the right thing to do - building the simplest and most elegant solution with new tools.
I probably ate something that disagreed with me, but I just want to break free from the shackles of these heavy-weight tools and fly! OK - that's enough.
Or, it might all be a catastrophic failure and I will be signing up to Careers 2.0 :)
Col
P.S> Usual disclaimer - still only written three lines of Clojure :)