thank you for taking Terrastore into consideration.
Unfortunately, I think Terrastore, as any other distributed nosql
database, wouldn't be a good fit for your use case, which seems to
require an embedded database.
Also, I think even a fast embedded database would hardly improve your
performance, since you're IO-bound and whether you store things as
plain files or database entries, IO will always be your bottleneck.
So, I'd suggest instead one (or a combination of) the following approaches:
1) Employ asynchronous writes: that is, do not immediately and
one-by-one dump files into filesystem, but enqueue them and continue
processing; this may hinder actual durability, but may not be a
problem for you since code analysis could be redone in case of
failures, and in case you couldn't tolerate data loss, implement a
simple redo-log.
2) Try saving file diffs, rather than the whole file every time it gets updated.
3) Implement the workspace as a log-structured storage system, that
is, always write your data sequentially on disk and keep a kind of
"file pointer" to each data entry: IO (for rotating disks) is
dominated by seek time, so writing everything sequentially with no
seeks will probably enhance performances a lot; you can take a look at
HawtJournal for that (https://github.com/sbtourist/hawtjournal).
Hope that helps.
Cheers,
Sergio B.
--
Sergio Bossa
http://www.linkedin.com/in/sergiob
Best,
Marcel