On Mon, Jun 4, 2012 at 9:04 AM, Tero Paananen <teropaana
...@gmail.com> wrote:
>
https://github.com/gorbachev/neo4j-batchinserterprovider-contrib
> Michael Hunger asked me to open source the BatchInserterIndexProvider
> implementations
> I experimented with during the time I was implementing the batch
> import process for the
> application I've been working on.
> Our graph data model has quite a few unique nodes. The data import
> process had to do
> a LOT of index lookups. The Lucene based index provider was slow, and
> performance
> started degrading once the index sizes grew larger. This is probably
> mostly due to an
> undersized server (I had too little memory on the server).
> As a replacement I tried Redis and Memcached first. While they were
> extremely quick,
> they failed, because I simply didn't have a server that could hold the
> entire index in
> memory, as required by Redis and Memcached. YMMV.
> The MongoDB BatchInserterIndexProvider, however, gave me a good
> constant
> performance. It wasn't as fast as Redis/Memcached, but it didn't
> degrade the
> way the Lucene based one did.
> So I was using these for speeding up the lookups for unique nodes
> during the
> batch import. I'm still using Lucene indexes with Neo4j.
> In the batch import process I was essentially adding properties into
> two indexes,
> one using Mongo and one using Lucene. I was then doing lookups only
> using
> the Mongo index:
> BatchInserter inserter = new BatchInserterImpl("/data/graph.db");
> BatchInserterIndexProvider indexProvider = new
> LuceneBatchInserterIndexProvider(inserter);
> BatchInserterIndexProvider lookupIndexProvider = new
> MongoBatchInserterIndexProvider();
> nodeIndex = indexProvider.nodeIndex(...);
> lookupIndex = lookupIndexProvider.nodeIndex(...);
> if (!lookupIndex.get("property", "value").hasNext()) {
> Long node = inserter.createNode(...);
> lookupIndex.add(node, ...);
> nodeIndex.add(node, ...);
> } else {
> ... node already exists...update or ignore
> }
> Given the amount of data we had (90M nodes, 240M relationships, and
> growing),
> the time savings with faster index lookups were definitely worth it.
> I don't have benchmarking numbers, because they would depend heavily
> on
> your particular use case. For my ircumstances MongoDB based index was
> a
> good solution.
> If you have any questions, let me know. There are basic unit tests
> included in
> the project, but it is entirely possible there are bugs left in the
> code that I didn't
> cover during my use of these classes.
> Please feel free to fork the GitHub project and make whatever changes
> you
> need.
> -TPP