Interested in the right use case for Terrastore

75 views
Skip to first unread message

Mike Kelp

unread,
Nov 7, 2011, 10:17:59 PM11/7/11
to terrastore-discussions
Hi all,

I'm really interested in Terrastore as a document store and had a
couple questions if anyone has the opportunity to help me better
understand.

I'm looking to store a lot of very small documents such as analytics
data, etc. and I think I'm getting thrown off by the deep Terracotta
integration of terrastore. My hope was that Terrastore would be suited
towards large databases of mostly disk oriented data, using Terracotta
for indexes and supporting the operation of the large document store.
Is this correct? I ask this question because it seems that Terrastore
has focused much more on data consistency and live scalability, so I
would like to use it in place of a Mongo / Couch DB.

Would the current release work in a scenario like above with 500 GB of
data or is Terrastore designed for a particular range that I'm unaware
of? Once again, this is inline with trying to create that really
solid, scalable, very simple data store (which I will likely use a
project like Elastic Search to search). The intent is to back a very
write heavy application, but allow mining and reporting on the data,
rolling out older data over time.

My search so far has turned me away from many of the other options
because they don't seem to be intended for very write heavy usage, or
scaling is done with simplistic partitioning, allowing a single node
to get overloaded, etc. Any information outside of these questions on
the general topic of what types of applications Terrastore IS designed
for would also welcome for myself and I'm sure others in the future.

Lastly, if Terrastore turns out to be the right project for us, where
do you mostly need help with the project? We are very busy typically
as a pretty diverse group of developers, but can do a lot with web
applications, mobile, Java, etc. or even if you just need some
documentation help. I just want to be aware of the best ways to help
going in because I feel its too easy to never ask the question if I
don't ask early :-)

Thanks in advance for any help the group can offer and I want to
express how much I appreciate your work and what you are sharing.

Cheers,
Mike.

Sergio Bossa

unread,
Nov 8, 2011, 9:41:22 AM11/8/11
to terrastore-...@googlegroups.com
Hi Mike,

thanks for your interest in Terrastore: I'll be more than happy to
help, my answers in-line :)

> I'm looking to store a lot of very small documents such as analytics
> data, etc. and I think I'm getting thrown off by the deep Terracotta
> integration of terrastore. My hope was that Terrastore would be suited
> towards large databases of mostly disk oriented data, using Terracotta
> for indexes and supporting the operation of the large document store.
> Is this correct? I ask this question because it seems that Terrastore
> has focused much more on data consistency and live scalability, so I
> would like to use it in place of a Mongo / Couch DB.

Terrastore is very well suited for storing small documents and
processing them i.e. for analytics purposes: we actually use it in
production for a similar use case.

> Would the current release work in a scenario like above with 500 GB of
> data or is Terrastore designed for a particular range that I'm unaware
> of?

Terrastore is designed/implemented to hold everything in the servers'
memory, and provide durable storage through the Terracotta master: so,
for storing 500 GBs of *live* data, you'd need:
1) An adeguate number of servers, such that the total available memory
almost matches that number (i.e. 50 servers * 10 GBs each).
2) An active master with a large quantity of memory, let's say at
least equal to [(available memory for a single server) + (500 - (total
available memory for all servers))], or multiple active masters if
using an ensemble.

You could get around those requirements by:
1) Reducing the quantity of live data, which would reduce requirements
of the first point above.
2) Trade memory space for latency, and reduce the quantity of
available memory for all servers: in such a case, servers will request
missing data from the master, and master will eventually read it from
disk if unavailable from its own memory cache.

BTW, please keep in mind that with 500 GB of data, starting up *the
whole cluster*, that is, masters and servers from scratch, will
probably take a somewhat long time (while just adding servers
shouldn't be affected that much).

> My search so far has turned me away from many of the other options
> because they don't seem to be intended for very write heavy usage, or
> scaling is done with simplistic partitioning, allowing a single node
> to get overloaded, etc.

Just out of curiosity, did you try other distributed data stores such
as Cassandra, Riak or HBase?

> Any information outside of these questions on
> the general topic of what types of applications Terrastore IS designed
> for would also welcome for myself and I'm sure others in the future.

I think Terrastore is geared towards storage and processing of
small-to-medium data sets, just that: it may be a good choice for
larger data sets too, but then you need to pay attention to memory
organization and management, as well as extra tunings.

> Lastly, if Terrastore turns out to be the right project for us, where
> do you mostly need help with the project? We are very busy typically
> as a pretty diverse group of developers, but can do a lot with web
> applications, mobile, Java, etc. or even if you just need some
> documentation help.

Technically speaking, pretty much everything Terrastore needs is
listed on the issue tracker at
http://code.google.com/p/terrastore/issues/list ... we could talk
about the most urgent issues there, or you could just point out to
your ones.
But also, Terrastore is still in its early stages, and it would really
need a larger community and more buzz :) So, if you'll try and like
it, it would be great if you could also write something about and
share your experiences.

> Thanks in advance for any help the group can offer and I want to
> express how much I appreciate your work and what you are sharing.

Thanks to you for your interest and very nice attitude :)
Hope I was able to help with your doubts, feel free to get back with
more questions.
Cheers!

Sergio B.

--
Sergio Bossa
http://www.linkedin.com/in/sergiob

Mike Kelp

unread,
Nov 8, 2011, 10:44:01 AM11/8/11
to terrastore-discussions
Thanks for the thorough response. Very helpful.

I've thoroughly checked out and played with MongoDB mostly, but am
finding that it has issue with the way it does exclusive locking on
writes, the replication is not fully mature yet, and the database can
not be scaled live easily. That said, it is a great project while not
ideal for my use case. I look forward to seeing it mature and having
the right task for it in the future.

CouchDB is the project I was most excited about as an Apache project,
but it scales only by replication, which for even a remotely large
dataset with heavy writes is not a great move either.

I've seen a little, but am very unfamiliar with HBase, however Riak
looks like it may be perfect as well given it does seem to scale very
well. Though it shouldn't affect us much, my main concern was the way
it handled inserts / updates to the same object, making sibling
documents and not easily just giving you back the latest version. I
think I may just need to install it and vet it much more thoroughly as
I'm sure that is configurable. Can I ask your thoughts on it so far?

I love the clustering setup and simple structure of terrastore, which
is what intrigued me and I knew it was based on terracotta (figuring
it was for the indexes and cache lookups), but didn't realize entirely
hehe. I'm realizing the key set of features I need are scaling by
partitioning or balance (not replication), fast writes, simple query
structure (but doesn't have to be totally mature yet because most of
my searching will be done via elastic search, so the key value store
is mostly just a store for the detailed records).

You seem from other posts very knowledgable on the disk IO side of
development. Is their perhaps a library or simpler distributed key/
lookup system, rather than full db implementation I should be looking
at instead. I just feel there is too much room for error to
reimplement a new flat storage system (not to mention reinventing the
wheel), you know?

Mike.
> listed on the issue tracker athttp://code.google.com/p/terrastore/issues/list... we could talk

Sergio Bossa

unread,
Nov 8, 2011, 11:17:40 AM11/8/11
to terrastore-...@googlegroups.com
On Tue, Nov 8, 2011 at 4:44 PM, Mike Kelp <mike...@gmail.com> wrote:

> I've thoroughly checked out and played with MongoDB mostly, but am
> finding that it has issue with the way it does exclusive locking on
> writes, the replication is not fully mature yet, and the database can
> not be scaled live easily. That said, it is a great project while not
> ideal for my use case. I look forward to seeing it mature and having
> the right task for it in the future.
> CouchDB is the project I was most excited about as an Apache project,
> but it scales only by replication, which for even a remotely large
> dataset with heavy writes is not a great move either.

Yep, I agree Mongo and Couch may not be that good for your use case as of now.

> I've seen a little, but am very unfamiliar with HBase, however Riak
> looks like it may be perfect as well given it does seem to scale very
> well. Though it shouldn't affect us much, my main concern was the way
> it handled inserts / updates to the same object, making sibling
> documents and not easily just giving you back the latest version. I
> think I may just need to install it and vet it much more thoroughly as
> I'm sure that is configurable. Can I ask your thoughts on it so far?

I don't have any real-world experience with Riak, but I know it works
damn well, and I'm pretty sure there must be a way to just ask for the
latest document version.
That said, there may be some "cons" (as there's no silver bullet) ...
1) If you're a JVM shop, lack of experience with Erlang and its VM may
be an issue, and you'd need to invest on it.
2) Riak is an eventually consistent store, so you either need to set
R+W>N and get consistent reads/writes, so that you can basically
ignore vector clocks, or take vector clocks into account; in the
latter case, I'd discourage you to just "ask for the latest document
version".

> I love the clustering setup and simple structure of terrastore, which
> is what intrigued me and I knew it was based on terracotta (figuring
> it was for the indexes and cache lookups), but didn't realize entirely
> hehe.

Well, it is based on Terracotta for cluster management and data
replication/storage ... is that a problem/concern for you?

> I'm realizing the key set of features I need are scaling by
> partitioning or balance (not replication), fast writes, simple query
> structure (but doesn't have to be totally mature yet because most of
> my searching will be done via elastic search, so the key value store
> is mostly just a store for the detailed records).

Terrastore doesn't shine in write scalability: I mean, performance is
around a few thousands writes per second (at least on the hardware I
tested), but scalability doesn't stay linear as you add servers
because it depends on the active master, so you'd need to use an
ensemble with multiple active masters, and the problem with the
ensemble is that you can't add/remove clusters as of now.

For everything else, I see Terrastore as a good fit, so I'd say: try
Terrastore with your expected numbers (in the end, you may not need
unbounded write scalability), and see how it works.

> Is their perhaps a library or simpler distributed key/
> lookup system, rather than full db implementation I should be looking
> at instead. I just feel there is too much room for error to
> reimplement a new flat storage system (not to mention reinventing the
> wheel), you know?

Do you mean an embedded/local database, or a distributed one
satisfying your requirements above?

Mike Kelp

unread,
Nov 13, 2011, 12:34:38 PM11/13/11
to terrastore-discussions
Thanks so much for your excellent replies Sergio.

I've been looking around a bit and thought I would come back to
respond and give you an update.

Firstly, Terrastore is a wicked cool project and I may have other
plans for it in the future. There is no concern for using the
Terracotta backend at all and I'm impressed with how innovatively you
have used it in Terrastore. It's applying what Terracotta is best at
to my use case, which doesn't benefit as much from using Memory as the
main storage area. I'm essentially going to use the KV database for a
glorified disk writer for massive amounts of small analytics data,
while using Elastic Search to fully index the last set of data from X
time (say, 60 days), rolling it out over time, and to maintain
aggregate reports over the whole time data is collected, so that
indexes of key data are aggregated over time and real time querying
can happen against the recent data. Beyond this, I would store ALL
data for a much longer period of time in the Store, allowing us to
reindex that data later or do a Map/Reduce on occasion to pull sets
into elastic search or for data that we simply need to grab as needed
(which we can wait for). Does this make the use case a bit more clear?

On the other hand, where I am really excited about Terrastore is my
history working with CDN and unique caching situations. I have found
many places where a CDN is ideal except for the fact that you need
some extra layer of security or just enough dynamism in the responses
to the client that the CDN cannot handle it. Think tokenized
authentication for a caching network being done with the best
combination of speed and lightweight integration with encryption
similar to Amazon's timed Urls but with integrated user auth and even
IP bound once first accessed. These are some places where Terrastore
and other Key-Value DBs will have an amazing amount of power. This is
all in addition to the fact that the language and application servers
we use for web applications (ColdFusion / Railo which are built on
Java) are also adding or have already added the ability to swap the
storage engine for the Session scope (and possibly others in the
future such as the Application and Server scopes) dynamically so an
application without changing a line of code could immediately become
scalable even when they weren't written with clustering in mind, which
comes up with some older applications and honestly is many times more
efficient than complex clustering solutions. Ok, getting myself a
little excited haha, sorry for going off there. :-)

The comment regarding a simple long term storage for the data kind of
highlights what I'm looking for in the last question. I was very
impressed with how Riak split the backend storage away from the rest
of the solution, highlighting their key feature being its management
of the cluster and nodes while optimizing performance (the things the
Akamai guys are really good at). I really hope this means solutions
like terrastore and others will be implemented (or I can implement) as
storage solutions for the right application too. Now if only they had
done the same thing with the search mechanism :-) Alas, I noticed that
they created Bitcask and also mention a few others that I would use
directly if it came down to dumbing down the data store even more.

After all that, I wanted to give you an update that I checked further
into Riak and have created a small Riak cluster on my local machine
and been quite impressed with the documentation and libraries as well
as the quality. I'm already pulling data, pushing it through RabbitMQ
and getting it into the database (though, for some reason, they didn't
allow the Protocol Buffer client to get generated keys on the server
rather than creating them yourself). It is really neat, but I am
having to deeply consider what the integration with Elastic Search
will be like. I'm really happy that you saw the power in making an
adapter for search mechanisms rather than tightly coupling the two
together. I'm going to have to figure out where data will go first and
how it will be passed effectively to the other now.

Regardless, I am going to be playing with Terrastore a bit more on
some instances in our virtual environment so I can get a feal for it
more. We actually have a small infrastructure currently but a pretty
significant amount of memory per core (we got lucky when VMWare
modified their licensing scheme and worked out best value for most
memory / CPU) so perhaps the ideal use case is likely in our medium
term haha. Finding Terrastore made me realize how wide-ranging the
possibilities are and look at the availability of resources a bit
differently regarding document storage and the potential of the CPU /
disk / memory resources we have available to us.

Thanks again for all your time and I'll definitely be spreading the
word about Terrastore, starting with a user group I attend. Will give
you a copy of anything if I get a demo going for it or have the
opportunity to do something more than just talk.

Cheers to you my friend, if you are ever in the Dallas, TX area I'll
buy you a beverage of your choice....hell, I'll do that anyway, just
let me know where to send it :-)

Mike.



On Nov 8, 10:17 am, Sergio Bossa <sergio.bo...@gmail.com> wrote:

Sergio Bossa

unread,
Nov 14, 2011, 12:11:16 PM11/14/11
to terrastore-...@googlegroups.com
Hi Mike,

my answers in-line :)

> Firstly, Terrastore is a wicked cool project and I may have other
> plans for it in the future. There is no concern for using the
> Terracotta backend at all and I'm impressed with how innovatively you
> have used it in Terrastore.

Thanks for the kind words :)

> I'm essentially going to use the KV database for a
> glorified disk writer for massive amounts of small analytics data,
> while using Elastic Search to fully index the last set of data from X
> time (say, 60 days), rolling it out over time, and to maintain
> aggregate reports over the whole time data is collected, so that
> indexes of key data are aggregated over time and real time querying
> can happen against the recent data. Beyond this, I would store ALL
> data for a much longer period of time in the Store, allowing us to
> reindex that data later or do a Map/Reduce on occasion to pull sets
> into elastic search or for data that we simply need to grab as needed
> (which we can wait for). Does this make the use case a bit more clear?

Yep.

If you want to go with Terrastore, given you're planning to host "old"
data too, you should really setup an ensemble: by doing so, you'll be
able to leverage multiple Terracotta active masters; unfortunately,
you'd have to do some upfront capacity planning, as the ensemble
cannot be dynamically expanded (that is, you cannot add more
clusters/masters), but I hope to find the time to fix this in the near
future.

Otherwise, you could go with Riak, as already mentioned, or even
ElasticSearch on its own, and store *all* your data in a single place.
In the latter case ElasticSearch would store in its own Lucene
directories both your data and related indexes: I'm somewhat
uncomfortable with this setup, as Lucene directories shouldn't be used
as datastores, but Shay Banon, ElasticSearch author, encourages that,
and he probably knows better than me, so give it a try too.

> On the other hand, where I am really excited about Terrastore is my
> history working with CDN and unique caching situations.

> [CUT]


> Ok, getting myself a
> little excited haha, sorry for going off there. :-)

Yeah ... sounds cool :)

> After all that, I wanted to give you an update that I checked further
> into Riak and have created a small Riak cluster on my local machine
> and been quite impressed with the documentation and libraries as well
> as the quality. I'm already pulling data, pushing it through RabbitMQ
> and getting it into the database (though, for some reason, they didn't
> allow the Protocol Buffer client to get generated keys on the server
> rather than creating them yourself). It is really neat, but I am
> having to deeply consider what the integration with Elastic Search
> will be like. I'm really happy that you saw the power in making an
> adapter for search mechanisms rather than tightly coupling the two
> together. I'm going to have to figure out where data will go first and
> how it will be passed effectively to the other now.

Yep, Riak is a great project backed by great people.
I agree with the Riak + ElasticSearch solution, certainly a great combo.

> Regardless, I am going to be playing with Terrastore a bit more on
> some instances in our virtual environment so I can get a feal for it
> more.

Cool ... do not hesitate to come back with questions ;)

> Thanks again for all your time and I'll definitely be spreading the
> word about Terrastore, starting with a user group I attend. Will give
> you a copy of anything if I get a demo going for it or have the
> opportunity to do something more than just talk.

That would be great, and much appreciated :)

> Cheers to you my friend, if you are ever in the Dallas, TX area I'll
> buy you a beverage of your choice....hell, I'll do that anyway, just
> let me know where to send it :-)

That was a pleasure, no worries ... we're passionate about our jobs,
and we do it with pleasure :)

Hope to read from you again,
Cheers!

Sergio B.

Reply all
Reply to author
Forward
0 new messages