Performance issues

15 views
Skip to first unread message

Gromgull

unread,
Aug 25, 2009, 9:14:57 AM8/25/09
to OpenAnzo
Hi OpenAnzo'ers,

We've been doing some testing of RDF databases lately, and we like all
the features, but ran into some performance issues. I would appreciate
it a lot if you could comment on the setup below and tell me if there
are some simple optimisation hacks I could apply?

We use version 3.1, 64 bit java, running against a postgresql DB,
giving the server 6GB of heap.

We loaded a dataset based on yago [1] - this is a 1.2Gb .n3 file (it's
very close to n-triples though, no fancy n3 features) containing about
19M triples. I can make the data available on request.

This took somewhere between 3 and 5 days to load :)
I started it Wednesday last week, and it finished between me going
home on Friday and Monday morning. In comparison, virtuoso loads the
same data in 22min. I expected openanzo to be slower, but not by that
much. LeeF helped me out on IRC and confirmed that this is atypical.

I loaded it with the commandline client, i.e.:

./anzo import -g urn:yago -v yago.n3

Now - after it was loaded we tried a range of queries [2], all of the
using SELECT DISTINCT and a LIMIT where we expected 1000s of results.
I timed them again using the commandline client, i.e.

./anzo query -y [-k] 'select blah'

Most of them are reasonably quick, i.e. some hundred milli-seconds.
With or without cache (-k) makes a bit of difference, but not much.
The first run after the server is restarted is often MUCH slower,
however, this is not so important.
What is worse is that some queries fall through a hole in the query-
optimiser and they suddenly take 5-6 seconds or even 5 minutes.

Now the questions :)

1. It is not unreasonable to expect OpenAnzo to deal with 19M triples?
(LeeF already answered "no it isn't")
2. Are there some optimisations I can do with server or db
configuration that will speed this up?
3. Currently we loaded everything into one graph and queried this. How
will query times and load times be affected when we have more graphs
and query all of them, or once we start using revisioning or other fun
features?


Thanks for any help!
Cheers,

- Gunnar Grimnes

[1] http://www.mpi-inf.mpg.de/yago-naga/yago/
[2] Queries: http://spreadsheets.google.com/pub?key=tgutqBcFvLeyKMoQUgaIZeA&single=true&gid=1&output=html
[3] Results: http://spreadsheets.google.com/pub?key=tgutqBcFvLeyKMoQUgaIZeA&single=true&gid=0&output=html

Lee Feigenbaum

unread,
Aug 25, 2009, 11:25:01 AM8/25/09
to open...@googlegroups.com
Hi Gunnar,

My apologies for potentially misleading you on IRC; I seem to have been
somewhat mistaken. It turns out we haven't put a lot of effort into
optimizing bulk load (or query, for that matter) on Postgres. I've
personally used Open Anzo on postgres with in the neighborhood of 10
million triples without too much trouble, but I can't recall offhand how
long it took me to load the DB.

The other supported DBs - Oracle, DB2, and MS SQL Server, should perform
much better. If it's an option for you, you might try out the free
version of DB2 - http://www-01.ibm.com/software/data/db2/express/

best,
Lee
--
Lee Feigenbaum
VP Technology & Standards, Cambridge Semantics Inc
l...@cambridgesemantics.com
1-617-553-1060
Reply all
Reply to author
Forward
0 new messages