Operating on large rdf-files (< 1GB)

1,841 views
Skip to first unread message

Benedikt Tröster

unread,
Nov 26, 2013, 11:09:37 AM11/26/13
to
Hello!

I'm trying to extract data from an extremely large rdf-file (~9GB). Unsurprisingly the parsing takes ages (just tested my code on my i7, 6GB machine for ~50minutes - it's not even finished parsing).

Here is my code:

It works on a small portion of the rdf-file (extracted sample data), which was given to me. Is there any way I could speedup the process?

//edit:

Okay, I got an error now: MemoryError - I guess that means my RAM is not sufficient?

//edit2:
Using sleepycat (g = Graph('sleepycat')) leads to:
Traceback (most recent call last):
  File "C:\Users\benedikt\Dropbox\dynamic\eclipse_workspace\rdf2jrcnames\main.py", line 8, in <module>
    g.parse(file=open("C:/Users/benedikt/Documents/gnd/gnd.rdf"))
  File "C:\Users\benedikt\Dropbox\dynamic\eclipse_workspace\rdf2jrcnames\rdflib\graph.py", line 1026, in parse
    parser.parse(source, self, **args)
  File "C:\Users\benedikt\Dropbox\dynamic\eclipse_workspace\rdf2jrcnames\rdflib\plugins\parsers\rdfxml.py", line 570, in parse
    self._parser.parse(source)
  File "C:\kivy1.6.0\Python\lib\xml\sax\expatreader.py", line 107, in parse
    xmlreader.IncrementalParser.parse(self, source)
  File "C:\kivy1.6.0\Python\lib\xml\sax\xmlreader.py", line 123, in parse
    self.feed(buffer)
  File "C:\kivy1.6.0\Python\lib\xml\sax\expatreader.py", line 207, in feed
    self._parser.Parse(data, isFinal)
  File "C:\kivy1.6.0\Python\lib\xml\sax\expatreader.py", line 360, in start_namespace_decl
    self._cont_handler.startPrefixMapping(prefix, uri)
  File "C:\Users\benedikt\Dropbox\dynamic\eclipse_workspace\rdf2jrcnames\rdflib\plugins\parsers\rdfxml.py", line 122, in startPrefixMapping
    self.store.bind(prefix, namespace or "", override=False)
  File "C:\Users\benedikt\Dropbox\dynamic\eclipse_workspace\rdf2jrcnames\rdflib\graph.py", line 902, in bind
    return self.namespace_manager.bind(
  File "C:\Users\benedikt\Dropbox\dynamic\eclipse_workspace\rdf2jrcnames\rdflib\graph.py", line 323, in _get_namespace_manager
    self.__namespace_manager = NamespaceManager(self)
  File "C:\Users\benedikt\Dropbox\dynamic\eclipse_workspace\rdf2jrcnames\rdflib\namespace.py", line 282, in __init__
    self.bind("xml", u"http://www.w3.org/XML/1998/namespace")
  File "C:\Users\benedikt\Dropbox\dynamic\eclipse_workspace\rdf2jrcnames\rdflib\namespace.py", line 353, in bind
    bound_namespace = self.store.namespace(prefix)
  File "C:\Users\benedikt\Dropbox\dynamic\eclipse_workspace\rdf2jrcnames\rdflib\plugins\sleepycat.py", line 441, in namespace
    ns = self.__namespace.get(prefix, None)
AttributeError: 'Sleepycat' object has no attribute '_Sleepycat__namespace'

Gunnar Aastrand Grimnes

unread,
Nov 26, 2013, 12:38:31 PM11/26/13
to rdfli...@googlegroups.com
Hi Benedikt, 

Python and RDFLib really isn't made for this sort of data-processinging - parsing will be the least of your problems, saving the graph in memory will also take much more space than the rdf/xml file takes on disk. 

I see a few options: 

* Use a streaming parser/converter like jena's riot tool to read in the rdf/xml and write out a line-based format line nquads/ntriples. Then you can use grep to find the parts of the file you want, ending up with a smaller rdf file you can process with rdflib. 

* Load the file into an on-disk database like Jena TDB - then use Jena Fuseki to setup a SPARQL endpoint and use the SparqlWrapper/RDFLib SPARQLStore to get out the data you want. 

* Use the rdflib streaming parsers I have a secret git branch on my computer do stream-process your data ;) 

Good luck!

- Gunnar



On 26 November 2013 16:47, Benedikt Tröster <ben...@gmail.com> wrote:
Hello!

I'm trying to extract data from an extremely large rdf-file (~9GB). Unsurprisingly the parsing takes ages (just tested my code on my i7, 6GB machine for ~50minutes - it's not even finished parsing).

Here is my code:

It works on a small portion of the rdf-file (extracted sample data), which was given to me. Is there any way I could speedup the process?

--
http://github.com/RDFLib
---
You received this message because you are subscribed to the Google Groups "rdflib-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rdflib-dev+...@googlegroups.com.
To post to this group, send email to rdfli...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/rdflib-dev/15a6d367-308c-4c54-a1ec-ee6ad78cb1c0%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.



--
http://gromgull.net

Benedikt Tröster

unread,
Nov 26, 2013, 4:46:19 PM11/26/13
to
wow, that sounds like much extra work.... I just need that for a small task in a project for university. Maybe I'll try to split the file into several files or find an other solution...

//edit:
about that streaming parser: that piece of software is not publicly available - do I understand correctly?

Osma Suominen

unread,
Nov 27, 2013, 3:17:19 AM11/27/13
to rdfli...@googlegroups.com
Hi Benedikt!

You could also try the librdf bindings for Python. The parser (raptor, a
C library) is at least an order of magnitude faster than the one in
rdflib, and uses much less memory (though rdflib 4.0 is much better than
earlier versions in this respect, it has my set-based in-memory store).
I use it when I really need speed, though the API is in my opinion not
as Pythonic as rdflib. I use rdflib whenever I can.

I've also made a cheat sheet for myself and some coworkers summarizing
the most important operations in both rdflib and librdf, maybe you can
use it too:
http://www.seco.tkk.fi/u/oisuomin/rdflib-librdf-cheat-sheet.odt

However, 9GB sounds really big. I doubt you will be able to load the
whole thing even with librdf. So splitting into smaller parts, maybe as
N-triples, is still a good idea.

-Osma

26.11.2013 21:29, Benedikt Tr�ster kirjoitti:
> wow, that sounds like much extra work.... I just need that for a small
> task in a project for university. Maybe I'll try to split the file into
> several files or find an other solution...
>
> On Tuesday, November 26, 2013 6:38:31 PM UTC+1, Gunnar Aastrand Grimnes
> wrote:
>
> Hi Benedikt,
>
> Python and RDFLib really isn't made for this sort of
> data-processinging - parsing will be the least of your problems,
> saving the graph in memory will also take much more space than the
> rdf/xml file takes on disk.
>
> I see a few options:
>
> * Use a streaming parser/converter like jena's riot tool to read in
> the rdf/xml and write out a line-based format line nquads/ntriples.
> Then you can use grep to find the parts of the file you want, ending
> up with a smaller rdf file you can process with rdflib.
>
> * Load the file into an on-disk database like Jena TDB - then use
> Jena Fuseki to setup a SPARQL endpoint and use the
> SparqlWrapper/RDFLib SPARQLStore to get out the data you want.
>
> * Use the rdflib streaming parsers I have a secret git branch on my
> computer do stream-process your data ;)
>
> Good luck!
>
> - Gunnar
>
>
>
> On 26 November 2013 16:47, Benedikt Tr�ster <ben...@gmail.com
> <javascript:>> wrote:
>
> Hello!
>
> I'm trying to extract data from an extremely large rdf-file
> (~9GB). Unsurprisingly the parsing takes ages (just tested my
> code on my i7, 6GB machine for ~50minutes - it's not even
> finished parsing).
>
> Here is my code:
> http://pastebin.com/LHveiKqc
> <http://www.google.com/url?q=http%3A%2F%2Fpastebin.com%2FLHveiKqc&sa=D&sntz=1&usg=AFQjCNG-itPaX7y-EaOqaucFWaoSzVyTGw>
>
> It works on a small portion of the rdf-file (extracted sample
> data), which was given to me. Is there any way I could speedup
> the process?
>
> --
> http://github.com/RDFLib
> <http://www.google.com/url?q=http%3A%2F%2Fgithub.com%2FRDFLib&sa=D&sntz=1&usg=AFQjCNE4rQFhqJJK8VaD5Z0Ga3CzF5RzWw>
> ---
> You received this message because you are subscribed to the
> Google Groups "rdflib-dev" group.
> To unsubscribe from this group and stop receiving emails from
> it, send an email to rdflib-dev+...@googlegroups.com <javascript:>.
> To post to this group, send email to rdfli...@googlegroups.com
> <javascript:>.
> <https://groups.google.com/d/msgid/rdflib-dev/15a6d367-308c-4c54-a1ec-ee6ad78cb1c0%40googlegroups.com>.
> For more options, visit https://groups.google.com/groups/opt_out
> <https://groups.google.com/groups/opt_out>.
>
>
>
>
> --
> http://gromgull.net
> <http://www.google.com/url?q=http%3A%2F%2Fgromgull.net&sa=D&sntz=1&usg=AFQjCNF7nYdOu5Dlh9dhffNf0E_bkAw47g>
>
>
> --
> http://github.com/RDFLib
> ---
> You received this message because you are subscribed to the Google
> Groups "rdflib-dev" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to rdflib-dev+...@googlegroups.com.
> To post to this group, send email to rdfli...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/rdflib-dev/a6ad37d9-4629-499f-9072-76a96d959cc5%40googlegroups.com.
> For more options, visit https://groups.google.com/groups/opt_out.


--
Osma Suominen
D.Sc. (Tech), Information Systems Specialist
National Library of Finland
P.O. Box 26 (Teollisuuskatu 23)
00014 HELSINGIN YLIOPISTO
Tel. +358 50 3199529
osma.s...@helsinki.fi
http://www.nationallibrary.fi

Graham Klyne

unread,
Nov 27, 2013, 4:41:29 AM11/27/13
to rdfli...@googlegroups.com
FWIW, I've also noticed similar performance problems parsing RDF - about 28Mb of RDF/XML into an in-memory graph. Memory wasn't a problem, but it did take about 2 minutes to parse and load, IIRC.  I think that was of the order of a million triples (I didn't count). I was running on a 64-bit MacOS system, with similar results experienced on 64 bit Linux.

(If you want to test in-memory performance using similar code, there's a utility here that might help:  https://github.com/gklyne/asqc)

#g

On Tuesday, 26 November 2013 15:47:09 UTC, Benedikt Tröster wrote:
Hello!

I'm trying to extract data from an extremely large rdf-file (~9GB). Unsurprisingly the parsing takes ages (just tested my code on my i7, 6GB machine for ~50minutes - it's not even finished parsing).

Here is my code:

It works on a small portion of the rdf-file (extracted sample data), which was given to me. Is there any way I could speedup the process?

Niklas Lindström

unread,
Nov 27, 2013, 4:59:17 AM11/27/13
to rdfli...@googlegroups.com
Just wishfully thinking out loud here..

It certainly sounds like we could do with a stream-based solution (like SAX, or perhaps better like iterparse in ElementTree [1] and lxml [2]). That'd be a great addition to RDFLib – ideally as an optional feature of the existing parsers. I reckon many of them can emit triples as they go already (anything that doesn't consume the input into memory before parsing should be able to do that).

With such a feature, Benedikt's code could look something like:

    for s, p, o in g.iterparse(file=open("Documents/gnd/gnd.rdf")):
        if p == GND.variantNameForThePerson:
           print ("%s ; %s" % (s, o)).encode('utf-8')

Cheers,
Niklas




On Wed, Nov 27, 2013 at 9:17 AM, Osma Suominen <osma.s...@helsinki.fi> wrote:
Hi Benedikt!

You could also try the librdf bindings for Python. The parser (raptor, a C library) is at least an order of magnitude faster than the one in rdflib, and uses much less memory (though rdflib 4.0 is much better than earlier versions in this respect, it has my set-based in-memory store). I use it when I really need speed, though the API is in my opinion not as Pythonic as rdflib. I use rdflib whenever I can.

I've also made a cheat sheet for myself and some coworkers summarizing the most important operations in both rdflib and librdf, maybe you can use it too:
http://www.seco.tkk.fi/u/oisuomin/rdflib-librdf-cheat-sheet.odt

However, 9GB sounds really big. I doubt you will be able to load the whole thing even with librdf. So splitting into smaller parts, maybe as N-triples, is still a good idea.

-Osma

26.11.2013 21:29, Benedikt Tröster kirjoitti:
wow, that sounds like much extra work.... I just need that for a small
task in a project for university. Maybe I'll try to split the file into
several files or find an other solution...

On Tuesday, November 26, 2013 6:38:31 PM UTC+1, Gunnar Aastrand Grimnes
wrote:

    Hi Benedikt,

    Python and RDFLib really isn't made for this sort of
    data-processinging - parsing will be the least of your problems,
    saving the graph in memory will also take much more space than the
    rdf/xml file takes on disk.

    I see a few options:

    * Use a streaming parser/converter like jena's riot tool to read in
    the rdf/xml and write out a line-based format line nquads/ntriples.
    Then you can use grep to find the parts of the file you want, ending
    up with a smaller rdf file you can process with rdflib.

    * Load the file into an on-disk database like Jena TDB - then use
    Jena Fuseki to setup a SPARQL endpoint and use the
    SparqlWrapper/RDFLib SPARQLStore to get out the data you want.

    * Use the rdflib streaming parsers I have a secret git branch on my
    computer do stream-process your data ;)

    Good luck!

    - Gunnar



    On 26 November 2013 16:47, Benedikt Tröster <ben...@gmail.com


--
Osma Suominen
D.Sc. (Tech), Information Systems Specialist
National Library of Finland
P.O. Box 26 (Teollisuuskatu 23)
00014 HELSINGIN YLIOPISTO
Tel. +358 50 3199529
osma.s...@helsinki.fi
http://www.nationallibrary.fi
--
http://github.com/RDFLib
--- You received this message because you are subscribed to the Google Groups "rdflib-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rdflib-dev+unsubscribe@googlegroups.com.

To post to this group, send email to rdfli...@googlegroups.com.

Gunnar Aastrand Grimnes

unread,
Nov 27, 2013, 5:02:14 AM11/27/13
to rdfli...@googlegroups.com
As I said, I have started work on this (ages ago), it's in a branch on my computer at home. I didn't go as far as rewriting the xml parser, but it already uses sax so it should be possible to make it truly streaming. 

I'll see if I can get it rebased on master and I'll push it so people could have a look.

- Gunnar


To unsubscribe from this group and stop receiving emails from it, send an email to rdflib-dev+...@googlegroups.com.

To post to this group, send email to rdfli...@googlegroups.com.

For more options, visit https://groups.google.com/groups/opt_out.

Marc-Antoine Parent

unread,
Nov 27, 2013, 8:27:00 AM11/27/13
to rdfli...@googlegroups.com
Hello!
I also recommend translating the xml-rdf to ntriples with rapper (part of rdflib.com) as a command-line, not even through python.
Then you can easily grep the relevant triples before feeding them to rdflib.
Also, would rdflib know how to read a stream of ntriples? That should be much easier than streaming xml.
Cheers,
Marc-Antoine

Le 2013-11-27 à 03:17, Osma Suominen <osma.s...@helsinki.fi> a écrit :

Hi Benedikt!

You could also try the librdf bindings for Python. The parser (raptor, a C library) is at least an order of magnitude faster than the one in rdflib, and uses much less memory (though rdflib 4.0 is much better than earlier versions in this respect, it has my set-based in-memory store). I use it when I really need speed, though the API is in my opinion not as Pythonic as rdflib. I use rdflib whenever I can.

I've also made a cheat sheet for myself and some coworkers summarizing the most important operations in both rdflib and librdf, maybe you can use it too:
http://www.seco.tkk.fi/u/oisuomin/rdflib-librdf-cheat-sheet.odt

However, 9GB sounds really big. I doubt you will be able to load the whole thing even with librdf. So splitting into smaller parts, maybe as N-triples, is still a good idea.

-Osma

26.11.2013 21:29, Benedikt Tröster kirjoitti:
wow, that sounds like much extra work.... I just need that for a small
task in a project for university. Maybe I'll try to split the file into
several files or find an other solution...

On Tuesday, November 26, 2013 6:38:31 PM UTC+1, Gunnar Aastrand Grimnes
wrote:

   Hi Benedikt,

   Python and RDFLib really isn't made for this sort of
   data-processinging - parsing will be the least of your problems,
   saving the graph in memory will also take much more space than the
   rdf/xml file takes on disk.

   I see a few options:

   * Use a streaming parser/converter like jena's riot tool to read in
   the rdf/xml and write out a line-based format line nquads/ntriples.
   Then you can use grep to find the parts of the file you want, ending
   up with a smaller rdf file you can process with rdflib.

   * Load the file into an on-disk database like Jena TDB - then use
   Jena Fuseki to setup a SPARQL endpoint and use the
   SparqlWrapper/RDFLib SPARQLStore to get out the data you want.

   * Use the rdflib streaming parsers I have a secret git branch on my
   computer do stream-process your data ;)

   Good luck!

   - Gunnar



   On 26 November 2013 16:47, Benedikt Tröster <ben...@gmail.com

pekka.v...@gmail.com

unread,
Feb 4, 2014, 7:09:03 AM2/4/14
to rdfli...@googlegroups.com
I ran into slow loading problems with a relatively small data set (20 MiB) and solved it by simply using the Python pickle-module to serialize the dataset to disk after first load.

This dropped the loading time from 60 seconds to only four seconds. I had to migrate to Python 3.2 for the pickle-module to work with rdflib Graph objects.

- Pekka Väänänen

On Tuesday, November 26, 2013 5:47:09 PM UTC+2, Benedikt Tröster wrote:
> Hello!
>
>
> I'm trying to extract data from an extremely large rdf-file (~9GB). Unsurprisingly the parsing takes ages (just tested my code on my i7, 6GB machine for ~50minutes - it's not even finished parsing).
>
>
> Here is my code:
> http://pastebin.com/LHveiKqc
>
>
>
> It works on a small portion of the rdf-file (extracted sample data), which was given to me. Is there any way I could speedup the process?
>
>
Reply all
Reply to author
Forward
0 new messages