problems uploading 1bln+ triples

66 views
Skip to first unread message

George London

unread,
Dec 29, 2011, 4:22:55 PM12/29/11
to Gremlin-users
Hello all,

I'm new to Neo4j and Tinkerpop (and Java, for that matter), but I'm
working on a project to pull together music information from around
the web, and I'm testing various backends to use as triplestore.

As a rough benchmark, I'm trying to upload all of DBPedia (starting
with the smaller "ontology-reconciled" version) to see what kind of
performance I can get. I've heard Neo4j + Tinkerpop recommended for
its ability to handle several billion triples, and I actually did
manage to to upload the whole of DBpedia using this blog post (http://
blog.acaro.org/entry/dbpedia4neo) and its underlying code as a base.
So I'm pretty sure that Neo4j is easily capable of handling a dataset
of this size.

However, that post is a bit stale and it uses an older version of
Tinkerpop which produces stores that are incompatible with the current
Neo4j standalone server (which I'd like to use to help explore the
dataset during development). So I upgraded to the latest version of
Tinkerpop, but the blog post uses the old TransactionalGraphHelper and
CommitManager functionality which seem to have been deprecated in
favor of Neo4jBatchGraph.

I rejiggered the code to use the BatchGraph (and it does go faster for
reasonable numbers of triples), but I'm now noticing that my uploader
consistently stalls out at around 1bln (out of ~1.5bln) triples
uploaded. It doesn't throw an error...processor utilization just drops
to 65% (down from 100%) and it simply does nothing (I've left it alone
for many hours to test that it's not just being slow). And then tells
me "termination failed" if I try to terminate the process.

I've tried a number of workarounds, such as splitting the upload file
into many files, but whether I try it as a 1 or 30 or 70 files, the
process stalls out after around 1bln triples. I've also tried closing
the database and reopening after x # of triples, but as it's
implemented, the Neo4jBatchGraph structure doesn't seem possible to
close and reopen. When one reopens an existing one, it seems to try to
remove some statements (I can't quite figure out which ones), which
leads it to try to do an outward traversal from some node (obviously
I'm still trying to figure out exactly how this all works...), and
throws an UnsupportedOperationException.

Whether I try closing and reopening the connection, the SAIL, the
GraphSail, or the Neo4jBatchGraph object, I get various errors (which
I've now spent several hours trying unsuccessfully to understand, but
I can detail what I've learned so far if helpful).

So...has anyone else had this problem and have an idea for a
workaround? Or another good solution for uploading 1bln+ triples?

As I said, I'm a n00b so I'm probably just doing something stupid. You
can see my code here: (https://github.com/rogueleaderr/dbpedia_project/
blob/master/src/main/java/com/hypejet/dbpediaproject/inserter/
DBpediaLoader.java)


But at this point, my options seem to be:

1) Re-implement the TransactionalGraphHelper/CommitManager

2) Continue trying to wrap my head around the deep guts of the
Neo4jBatchGraph object so that I can tweak it to either a) let itself
be reopened or b) not stall in the first place (though it's really
hard to diagnose why it's stalling at all since there is no error
message and it takes 3+ hours of continuous uploading to even get to
the stall point).

3) Roll back to the older version of Tinkerpop, and accept that my DB
is going to be incompatible with any new releases.

(1) or (2) are a significant stretch given my minimal skill level and
(3) just sounds likely to cause bigger problems down the road. So any
help / advice would be much appreciated!

Thanks,

George
@rogueleaderr

Peter Neubauer

unread,
Dec 30, 2011, 2:33:19 AM12/30/11
to gremli...@googlegroups.com

George,
I have a newer fork if the dbpedia project on github under peterneubauer. Otherwise, I think I remember the batch graph not being fully compliant to be used under the sail stack, but please contact me off list so we can track this down .

Cheers

/peter

Sent from a device with crappy keyboard and autocorrect

Marcin Cieślik

unread,
Jan 1, 2012, 11:57:45 AM1/1/12
to gremli...@googlegroups.com
Hi,

I have also tried uploading large RDF data sets to neo4j via gremlin / blueprints, but without success. I have tried the simplest solution i.e. a SailGraph(GraphSail(Neo4jGraph)) graph (with graph.setMaxBufferSize(100000)) and its loadRDF method. On a low-mem (4Gb) laptop I get semi-random exceptions deep within Neo4j, on a large high-mem server no exceptions, but the process grinds (see below it does not stop completely it just gets progressively slower). I get constant 100% CPU usage and almost no IO. Surprisingly I was able to load the same dataset before manually by parsing RDFXML and adding triples via 'addEdge' (in that case the process was very slow at the beginning and accelerated, also the Lucene indices blew to ~10x the size of data). I am also able to load the data quickly (limited by the IO) with a NativeStoreSailGraph (watch for the no-commit on shutdown bug). In my case the OpenRDF NativeStore is good enough and this is what I have settled on currently. 

Yours,
Marcin


PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND                                                                                                             
9934 marcin     20   0 9790m 2.1g 518m S 103.0  1.7 238:43.25 java   

These are ~33mb chunks of RDFXML, the times are in seconds. When using the NativeStoreSailGraph each chunks loads in ~3-4s.

string9-00.xml 
2516.991
string9-01.xml 
6350.653
string9-02.xml 
10349.973
string9-03.xml 
string9-04.xml
7383.232
string9-05.xml
13416.607
string9-06.xml
26746.148
string9-07.xml
30810.077

Peter Neubauer

unread,
Jan 1, 2012, 12:19:39 PM1/1/12
to gremli...@googlegroups.com

Mmh,
Sounds like the rdf layers need to be profiled. Anyone got time to track down where the time is spent?

Cheers

/peter

Sent from a device with crappy keyboard and autocorrect

Reply all
Reply to author
Forward
0 new messages