Using Terrastore as IDE workspace data backend

Marcel Bruch

unread,

Sep 26, 2011, 12:14:59 AM9/26/11

to terrastore-discussions

Hi,

I'm looking for some advice whether Terrastore might be a good choice
for my problem.
It's not a common setup for terrastore, thus, my question. Any
comments on this are greatly appreciated.

= Short description of the challenge =

We've built a Eclipse based tool that analyzes java source files and
stores its analysis results in additional files. The workspace
potentially has hundreds of projects and each project may have up to a
few thousands of files. Say, there will be 200 projects and 1000 java
source files per project in a single workspace. Then, there will be
200*1000 = 200.000 files.

On a full workspace build, all these 200k files have to be compiled
(by the IDE) and analyzed (by our tool) at once and the analysis
results have to be dumped to disk rather fast.
But the most common use case is that a single file is changed several
times per minute, and thus, gets frequently analyzed and dumped.

At the moment, the analysis results are dumped on disk as plain json
files; one json file for each java class (class name is the key). Each
json file is around 5 to 100kb in size; some files grow up to several
megabytes (<10mb)

= Question =

We would like to change the simple file system approach by a more
sophisticated approach and I wonder whether Terrastore may be a
suitable backend for this use case - especially give that master and
server has to run on a developer machine...

What's your suggestion? Is Terrastore capable to quickly load and
store our data - even if 200k files (nodes + their sub-nodes) have to
be updated in very short time?

Thanks for your suggestions. I've you need more details on what
operations are performed or how data looks like, I would be glad to
take your questions.

Marcel

--
Eclipse Code Recommenders:
w www.eclipse.org/recommenders
tw www.twitter.com/marcelbruch
g+ www.gplus.to/marcelbruch

Sergio Bossa

unread,

Sep 26, 2011, 4:39:20 AM9/26/11

to terrastore-...@googlegroups.com

Hi Marcel,

thank you for taking Terrastore into consideration.

Unfortunately, I think Terrastore, as any other distributed nosql
database, wouldn't be a good fit for your use case, which seems to
require an embedded database.
Also, I think even a fast embedded database would hardly improve your
performance, since you're IO-bound and whether you store things as
plain files or database entries, IO will always be your bottleneck.
So, I'd suggest instead one (or a combination of) the following approaches:
1) Employ asynchronous writes: that is, do not immediately and
one-by-one dump files into filesystem, but enqueue them and continue
processing; this may hinder actual durability, but may not be a
problem for you since code analysis could be redone in case of
failures, and in case you couldn't tolerate data loss, implement a
simple redo-log.
2) Try saving file diffs, rather than the whole file every time it gets updated.
3) Implement the workspace as a log-structured storage system, that
is, always write your data sequentially on disk and keep a kind of
"file pointer" to each data entry: IO (for rotating disks) is
dominated by seek time, so writing everything sequentially with no
seeks will probably enhance performances a lot; you can take a look at
HawtJournal for that (https://github.com/sbtourist/hawtjournal).

Hope that helps.
Cheers,

Sergio B.

--
Sergio Bossa
http://www.linkedin.com/in/sergiob

Marcel Bruch

unread,

Sep 26, 2011, 5:10:01 AM9/26/11

to terrastore-...@googlegroups.com

Thanks, that's what I thought. However, we'll consider Terrastore for server side backend :)

Best,
Marcel

Reply all

Reply to author

Forward