BatchInserterIndexProviders for Memcached, Redis and MongoDB

72 views
Skip to first unread message

Tero Paananen

unread,
Jun 4, 2012, 12:04:07 PM6/4/12
to Neo4j
https://github.com/gorbachev/neo4j-batchinserterprovider-contrib

Michael Hunger asked me to open source the BatchInserterIndexProvider
implementations
I experimented with during the time I was implementing the batch
import process for the
application I've been working on.

Our graph data model has quite a few unique nodes. The data import
process had to do
a LOT of index lookups. The Lucene based index provider was slow, and
performance
started degrading once the index sizes grew larger. This is probably
mostly due to an
undersized server (I had too little memory on the server).

As a replacement I tried Redis and Memcached first. While they were
extremely quick,
they failed, because I simply didn't have a server that could hold the
entire index in
memory, as required by Redis and Memcached. YMMV.

The MongoDB BatchInserterIndexProvider, however, gave me a good
constant
performance. It wasn't as fast as Redis/Memcached, but it didn't
degrade the
way the Lucene based one did.

So I was using these for speeding up the lookups for unique nodes
during the
batch import. I'm still using Lucene indexes with Neo4j.

In the batch import process I was essentially adding properties into
two indexes,
one using Mongo and one using Lucene. I was then doing lookups only
using
the Mongo index:

BatchInserter inserter = new BatchInserterImpl("/data/graph.db");
BatchInserterIndexProvider indexProvider = new
LuceneBatchInserterIndexProvider(inserter);
BatchInserterIndexProvider lookupIndexProvider = new
MongoBatchInserterIndexProvider();

nodeIndex = indexProvider.nodeIndex(...);
lookupIndex = lookupIndexProvider.nodeIndex(...);

if (!lookupIndex.get("property", "value").hasNext()) {
Long node = inserter.createNode(...);
lookupIndex.add(node, ...);
nodeIndex.add(node, ...);
} else {
... node already exists...update or ignore
}

Given the amount of data we had (90M nodes, 240M relationships, and
growing),
the time savings with faster index lookups were definitely worth it.

I don't have benchmarking numbers, because they would depend heavily
on
your particular use case. For my ircumstances MongoDB based index was
a
good solution.

If you have any questions, let me know. There are basic unit tests
included in
the project, but it is entirely possible there are bugs left in the
code that I didn't
cover during my use of these classes.

Please feel free to fork the GitHub project and make whatever changes
you
need.

-TPP

Peter Neubauer

unread,
Jun 4, 2012, 7:03:25 PM6/4/12
to ne...@googlegroups.com
That is very cool Tero,
thank you for that contribution! I would love to take the next lab day
and try out the MongoDB or MemCached index for inserting spatial data.
That will be fun!

Cheers,

/peter neubauer

G:  neubauer.peter
S:  peter.neubauer
P:  +46 704 106975
L:   http://www.linkedin.com/in/neubauer
T:   @peterneubauer

If you can write, you can code - @coderdojomalmo
If you can sketch, you can use a graph database - @neo4j
Reply all
Reply to author
Forward
0 new messages