thanks for your interest in Terrastore: I'll be more than happy to
help, my answers in-line :)
> I'm looking to store a lot of very small documents such as analytics
> data, etc. and I think I'm getting thrown off by the deep Terracotta
> integration of terrastore. My hope was that Terrastore would be suited
> towards large databases of mostly disk oriented data, using Terracotta
> for indexes and supporting the operation of the large document store.
> Is this correct? I ask this question because it seems that Terrastore
> has focused much more on data consistency and live scalability, so I
> would like to use it in place of a Mongo / Couch DB.
Terrastore is very well suited for storing small documents and
processing them i.e. for analytics purposes: we actually use it in
production for a similar use case.
> Would the current release work in a scenario like above with 500 GB of
> data or is Terrastore designed for a particular range that I'm unaware
> of?
Terrastore is designed/implemented to hold everything in the servers'
memory, and provide durable storage through the Terracotta master: so,
for storing 500 GBs of *live* data, you'd need:
1) An adeguate number of servers, such that the total available memory
almost matches that number (i.e. 50 servers * 10 GBs each).
2) An active master with a large quantity of memory, let's say at
least equal to [(available memory for a single server) + (500 - (total
available memory for all servers))], or multiple active masters if
using an ensemble.
You could get around those requirements by:
1) Reducing the quantity of live data, which would reduce requirements
of the first point above.
2) Trade memory space for latency, and reduce the quantity of
available memory for all servers: in such a case, servers will request
missing data from the master, and master will eventually read it from
disk if unavailable from its own memory cache.
BTW, please keep in mind that with 500 GB of data, starting up *the
whole cluster*, that is, masters and servers from scratch, will
probably take a somewhat long time (while just adding servers
shouldn't be affected that much).
> My search so far has turned me away from many of the other options
> because they don't seem to be intended for very write heavy usage, or
> scaling is done with simplistic partitioning, allowing a single node
> to get overloaded, etc.
Just out of curiosity, did you try other distributed data stores such
as Cassandra, Riak or HBase?
> Any information outside of these questions on
> the general topic of what types of applications Terrastore IS designed
> for would also welcome for myself and I'm sure others in the future.
I think Terrastore is geared towards storage and processing of
small-to-medium data sets, just that: it may be a good choice for
larger data sets too, but then you need to pay attention to memory
organization and management, as well as extra tunings.
> Lastly, if Terrastore turns out to be the right project for us, where
> do you mostly need help with the project? We are very busy typically
> as a pretty diverse group of developers, but can do a lot with web
> applications, mobile, Java, etc. or even if you just need some
> documentation help.
Technically speaking, pretty much everything Terrastore needs is
listed on the issue tracker at
http://code.google.com/p/terrastore/issues/list ... we could talk
about the most urgent issues there, or you could just point out to
your ones.
But also, Terrastore is still in its early stages, and it would really
need a larger community and more buzz :) So, if you'll try and like
it, it would be great if you could also write something about and
share your experiences.
> Thanks in advance for any help the group can offer and I want to
> express how much I appreciate your work and what you are sharing.
Thanks to you for your interest and very nice attitude :)
Hope I was able to help with your doubts, feel free to get back with
more questions.
Cheers!
Sergio B.
--
Sergio Bossa
http://www.linkedin.com/in/sergiob
> I've thoroughly checked out and played with MongoDB mostly, but am
> finding that it has issue with the way it does exclusive locking on
> writes, the replication is not fully mature yet, and the database can
> not be scaled live easily. That said, it is a great project while not
> ideal for my use case. I look forward to seeing it mature and having
> the right task for it in the future.
> CouchDB is the project I was most excited about as an Apache project,
> but it scales only by replication, which for even a remotely large
> dataset with heavy writes is not a great move either.
Yep, I agree Mongo and Couch may not be that good for your use case as of now.
> I've seen a little, but am very unfamiliar with HBase, however Riak
> looks like it may be perfect as well given it does seem to scale very
> well. Though it shouldn't affect us much, my main concern was the way
> it handled inserts / updates to the same object, making sibling
> documents and not easily just giving you back the latest version. I
> think I may just need to install it and vet it much more thoroughly as
> I'm sure that is configurable. Can I ask your thoughts on it so far?
I don't have any real-world experience with Riak, but I know it works
damn well, and I'm pretty sure there must be a way to just ask for the
latest document version.
That said, there may be some "cons" (as there's no silver bullet) ...
1) If you're a JVM shop, lack of experience with Erlang and its VM may
be an issue, and you'd need to invest on it.
2) Riak is an eventually consistent store, so you either need to set
R+W>N and get consistent reads/writes, so that you can basically
ignore vector clocks, or take vector clocks into account; in the
latter case, I'd discourage you to just "ask for the latest document
version".
> I love the clustering setup and simple structure of terrastore, which
> is what intrigued me and I knew it was based on terracotta (figuring
> it was for the indexes and cache lookups), but didn't realize entirely
> hehe.
Well, it is based on Terracotta for cluster management and data
replication/storage ... is that a problem/concern for you?
> I'm realizing the key set of features I need are scaling by
> partitioning or balance (not replication), fast writes, simple query
> structure (but doesn't have to be totally mature yet because most of
> my searching will be done via elastic search, so the key value store
> is mostly just a store for the detailed records).
Terrastore doesn't shine in write scalability: I mean, performance is
around a few thousands writes per second (at least on the hardware I
tested), but scalability doesn't stay linear as you add servers
because it depends on the active master, so you'd need to use an
ensemble with multiple active masters, and the problem with the
ensemble is that you can't add/remove clusters as of now.
For everything else, I see Terrastore as a good fit, so I'd say: try
Terrastore with your expected numbers (in the end, you may not need
unbounded write scalability), and see how it works.
> Is their perhaps a library or simpler distributed key/
> lookup system, rather than full db implementation I should be looking
> at instead. I just feel there is too much room for error to
> reimplement a new flat storage system (not to mention reinventing the
> wheel), you know?
Do you mean an embedded/local database, or a distributed one
satisfying your requirements above?
my answers in-line :)
> Firstly, Terrastore is a wicked cool project and I may have other
> plans for it in the future. There is no concern for using the
> Terracotta backend at all and I'm impressed with how innovatively you
> have used it in Terrastore.
Thanks for the kind words :)
> I'm essentially going to use the KV database for a
> glorified disk writer for massive amounts of small analytics data,
> while using Elastic Search to fully index the last set of data from X
> time (say, 60 days), rolling it out over time, and to maintain
> aggregate reports over the whole time data is collected, so that
> indexes of key data are aggregated over time and real time querying
> can happen against the recent data. Beyond this, I would store ALL
> data for a much longer period of time in the Store, allowing us to
> reindex that data later or do a Map/Reduce on occasion to pull sets
> into elastic search or for data that we simply need to grab as needed
> (which we can wait for). Does this make the use case a bit more clear?
Yep.
If you want to go with Terrastore, given you're planning to host "old"
data too, you should really setup an ensemble: by doing so, you'll be
able to leverage multiple Terracotta active masters; unfortunately,
you'd have to do some upfront capacity planning, as the ensemble
cannot be dynamically expanded (that is, you cannot add more
clusters/masters), but I hope to find the time to fix this in the near
future.
Otherwise, you could go with Riak, as already mentioned, or even
ElasticSearch on its own, and store *all* your data in a single place.
In the latter case ElasticSearch would store in its own Lucene
directories both your data and related indexes: I'm somewhat
uncomfortable with this setup, as Lucene directories shouldn't be used
as datastores, but Shay Banon, ElasticSearch author, encourages that,
and he probably knows better than me, so give it a try too.
> On the other hand, where I am really excited about Terrastore is my
> history working with CDN and unique caching situations.
> [CUT]
> Ok, getting myself a
> little excited haha, sorry for going off there. :-)
Yeah ... sounds cool :)
> After all that, I wanted to give you an update that I checked further
> into Riak and have created a small Riak cluster on my local machine
> and been quite impressed with the documentation and libraries as well
> as the quality. I'm already pulling data, pushing it through RabbitMQ
> and getting it into the database (though, for some reason, they didn't
> allow the Protocol Buffer client to get generated keys on the server
> rather than creating them yourself). It is really neat, but I am
> having to deeply consider what the integration with Elastic Search
> will be like. I'm really happy that you saw the power in making an
> adapter for search mechanisms rather than tightly coupling the two
> together. I'm going to have to figure out where data will go first and
> how it will be passed effectively to the other now.
Yep, Riak is a great project backed by great people.
I agree with the Riak + ElasticSearch solution, certainly a great combo.
> Regardless, I am going to be playing with Terrastore a bit more on
> some instances in our virtual environment so I can get a feal for it
> more.
Cool ... do not hesitate to come back with questions ;)
> Thanks again for all your time and I'll definitely be spreading the
> word about Terrastore, starting with a user group I attend. Will give
> you a copy of anything if I get a demo going for it or have the
> opportunity to do something more than just talk.
That would be great, and much appreciated :)
> Cheers to you my friend, if you are ever in the Dallas, TX area I'll
> buy you a beverage of your choice....hell, I'll do that anyway, just
> let me know where to send it :-)
That was a pleasure, no worries ... we're passionate about our jobs,
and we do it with pleasure :)
Hope to read from you again,
Cheers!
Sergio B.