Hello all,
I'm new to Neo4j and Tinkerpop (and Java, for that matter), but I'm
working on a project to pull together music information from around
the web, and I'm testing various backends to use as triplestore.
As a rough benchmark, I'm trying to upload all of DBPedia (starting
with the smaller "ontology-reconciled" version) to see what kind of
performance I can get. I've heard Neo4j + Tinkerpop recommended for
its ability to handle several billion triples, and I actually did
manage to to upload the whole of DBpedia using this blog post (http://
blog.acaro.org/entry/dbpedia4neo) and its underlying code as a base.
So I'm pretty sure that Neo4j is easily capable of handling a dataset
of this size.
However, that post is a bit stale and it uses an older version of
Tinkerpop which produces stores that are incompatible with the current
Neo4j standalone server (which I'd like to use to help explore the
dataset during development). So I upgraded to the latest version of
Tinkerpop, but the blog post uses the old TransactionalGraphHelper and
CommitManager functionality which seem to have been deprecated in
favor of Neo4jBatchGraph.
I rejiggered the code to use the BatchGraph (and it does go faster for
reasonable numbers of triples), but I'm now noticing that my uploader
consistently stalls out at around 1bln (out of ~1.5bln) triples
uploaded. It doesn't throw an error...processor utilization just drops
to 65% (down from 100%) and it simply does nothing (I've left it alone
for many hours to test that it's not just being slow). And then tells
me "termination failed" if I try to terminate the process.
I've tried a number of workarounds, such as splitting the upload file
into many files, but whether I try it as a 1 or 30 or 70 files, the
process stalls out after around 1bln triples. I've also tried closing
the database and reopening after x # of triples, but as it's
implemented, the Neo4jBatchGraph structure doesn't seem possible to
close and reopen. When one reopens an existing one, it seems to try to
remove some statements (I can't quite figure out which ones), which
leads it to try to do an outward traversal from some node (obviously
I'm still trying to figure out exactly how this all works...), and
throws an UnsupportedOperationException.
Whether I try closing and reopening the connection, the SAIL, the
GraphSail, or the Neo4jBatchGraph object, I get various errors (which
I've now spent several hours trying unsuccessfully to understand, but
I can detail what I've learned so far if helpful).
So...has anyone else had this problem and have an idea for a
workaround? Or another good solution for uploading 1bln+ triples?
As I said, I'm a n00b so I'm probably just doing something stupid. You
can see my code here: (
https://github.com/rogueleaderr/dbpedia_project/
blob/master/src/main/java/com/hypejet/dbpediaproject/inserter/
DBpediaLoader.java)
But at this point, my options seem to be:
1) Re-implement the TransactionalGraphHelper/CommitManager
2) Continue trying to wrap my head around the deep guts of the
Neo4jBatchGraph object so that I can tweak it to either a) let itself
be reopened or b) not stall in the first place (though it's really
hard to diagnose why it's stalling at all since there is no error
message and it takes 3+ hours of continuous uploading to even get to
the stall point).
3) Roll back to the older version of Tinkerpop, and accept that my DB
is going to be incompatible with any new releases.
(1) or (2) are a significant stretch given my minimal skill level and
(3) just sounds likely to cause bigger problems down the road. So any
help / advice would be much appreciated!
Thanks,
George
@rogueleaderr