DBpedia ingest taking 15+ days

45 views
Skip to first unread message

Ruben Verborgh

unread,
Nov 18, 2013, 7:31:35 AM11/18/13
to cumulus...@googlegroups.com
Hi CumulusRDF developers,

I wanted to give CumulusRDF a try so I started ingesting DBpedia.
My configuration might not be ideal, but it was just a test: one server with Cassandra (nothing else on it).

However, the ingest process was already taking more than 15 days (1 process @ 100% CPU time)
so I actually assume that I'm doing something wrong.

Any insights in what might have happened,
and how long should a DBpedia ingest normally take?
(Please don't say "16 days" because I killed the process :-)

Best,

Ruben

Andreas Wagner

unread,
Nov 18, 2013, 5:38:39 PM11/18/13
to cumulus...@googlegroups.com
Hi Ruben

15days is way too long ;) Could you clarify a couple of things:

* which version are you using?
* what is your current config?
* do you have an logs and/or errors in your catalina.out?

In general, I'd recommend using the Loader-CLI:

java -Xmx ... -cp ... edu.kit.aifb.cumulus.cli.Load

with keeping the number of loading threads in mind, e.g., something like
2-3 parallel threads on your VM should be fine (if you don't run
anything else critical on it ;)). See also [1].

As a side note: we are currently putting a new version 1.0.0 online. My
student did already some initial work w.r.t. wiki-documentation,
java-docs, maven-repository etc. I'm doing some last fixes as we speak,
and a new stable version should be online this week ;)

HTH
Andreas

[1] http://code.google.com/p/cumulusrdf/wiki/CLI
> --
> You received this message because you are subscribed to the Google
> Groups "cumulusrdf" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to cumulusrdf-li...@googlegroups.com.
> For more options, visit https://groups.google.com/groups/opt_out.

Ruben Verborgh

unread,
Nov 19, 2013, 1:51:35 PM11/19/13
to cumulus...@googlegroups.com, andreas.jo...@googlemail.com
Hi Andreas,

Thanks for your help!

I was using the latest version I found one month ago, and using the Loader-CLI.
I'll wait for version 1.0.0 and try again; I'll let you know how it goes!

Ruben

Andreas Wagner

unread,
Nov 21, 2013, 9:21:12 PM11/21/13
to Ruben Verborgh, cumulus...@googlegroups.com
Hi Ruben,

I just deployed the new v1.0.0. It contains a lot of improvements in
contrast to v0.6.0.

Concerning your performance issues:
* Use the Loader CLI [1]
* Increase the threads parameter (default: 1) depending on your #CPUs,
e.g., 2-3
* Increase batch-size parameter (default: 100) depending on your memory,
e.g., 1000 - 10000

Please also watch your catalina.out logs. CumulusRDF will log
performance statistics, i.e., #triples inserted, which may be helpful ...

Andrea Gazzarini

unread,
May 25, 2014, 3:55:36 PM5/25/14
to cumulus...@googlegroups.com
Hi Ruben,
if you're still there ;) can I ask you some information? Specifically I'd need

- OS / RAM / CPU
- Which dppedia dumps you were trying to load? A link would be great

I would like to run a similar test using the next incoming version (1.1.0)

Many thanks in advance

Andrea
Reply all
Reply to author
Forward
0 new messages