Traceback (most recent call last):File "C:\Users\benedikt\Dropbox\dynamic\eclipse_workspace\rdf2jrcnames\main.py", line 8, in <module>g.parse(file=open("C:/Users/benedikt/Documents/gnd/gnd.rdf"))File "C:\Users\benedikt\Dropbox\dynamic\eclipse_workspace\rdf2jrcnames\rdflib\graph.py", line 1026, in parseparser.parse(source, self, **args)File "C:\Users\benedikt\Dropbox\dynamic\eclipse_workspace\rdf2jrcnames\rdflib\plugins\parsers\rdfxml.py", line 570, in parseself._parser.parse(source)File "C:\kivy1.6.0\Python\lib\xml\sax\expatreader.py", line 107, in parsexmlreader.IncrementalParser.parse(self, source)File "C:\kivy1.6.0\Python\lib\xml\sax\xmlreader.py", line 123, in parseself.feed(buffer)File "C:\kivy1.6.0\Python\lib\xml\sax\expatreader.py", line 207, in feedself._parser.Parse(data, isFinal)File "C:\kivy1.6.0\Python\lib\xml\sax\expatreader.py", line 360, in start_namespace_declself._cont_handler.startPrefixMapping(prefix, uri)File "C:\Users\benedikt\Dropbox\dynamic\eclipse_workspace\rdf2jrcnames\rdflib\plugins\parsers\rdfxml.py", line 122, in startPrefixMappingself.store.bind(prefix, namespace or "", override=False)File "C:\Users\benedikt\Dropbox\dynamic\eclipse_workspace\rdf2jrcnames\rdflib\graph.py", line 902, in bindreturn self.namespace_manager.bind(File "C:\Users\benedikt\Dropbox\dynamic\eclipse_workspace\rdf2jrcnames\rdflib\graph.py", line 323, in _get_namespace_managerself.__namespace_manager = NamespaceManager(self)File "C:\Users\benedikt\Dropbox\dynamic\eclipse_workspace\rdf2jrcnames\rdflib\namespace.py", line 282, in __init__self.bind("xml", u"http://www.w3.org/XML/1998/namespace")File "C:\Users\benedikt\Dropbox\dynamic\eclipse_workspace\rdf2jrcnames\rdflib\namespace.py", line 353, in bindbound_namespace = self.store.namespace(prefix)File "C:\Users\benedikt\Dropbox\dynamic\eclipse_workspace\rdf2jrcnames\rdflib\plugins\sleepycat.py", line 441, in namespacens = self.__namespace.get(prefix, None)AttributeError: 'Sleepycat' object has no attribute '_Sleepycat__namespace'
Hello!I'm trying to extract data from an extremely large rdf-file (~9GB). Unsurprisingly the parsing takes ages (just tested my code on my i7, 6GB machine for ~50minutes - it's not even finished parsing).Here is my code:It works on a small portion of the rdf-file (extracted sample data), which was given to me. Is there any way I could speedup the process?
--
http://github.com/RDFLib
---
You received this message because you are subscribed to the Google Groups "rdflib-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rdflib-dev+...@googlegroups.com.
To post to this group, send email to rdfli...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/rdflib-dev/15a6d367-308c-4c54-a1ec-ee6ad78cb1c0%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
Hello!I'm trying to extract data from an extremely large rdf-file (~9GB). Unsurprisingly the parsing takes ages (just tested my code on my i7, 6GB machine for ~50minutes - it's not even finished parsing).Here is my code:
It works on a small portion of the rdf-file (extracted sample data), which was given to me. Is there any way I could speedup the process?
Hi Benedikt!
You could also try the librdf bindings for Python. The parser (raptor, a C library) is at least an order of magnitude faster than the one in rdflib, and uses much less memory (though rdflib 4.0 is much better than earlier versions in this respect, it has my set-based in-memory store). I use it when I really need speed, though the API is in my opinion not as Pythonic as rdflib. I use rdflib whenever I can.
I've also made a cheat sheet for myself and some coworkers summarizing the most important operations in both rdflib and librdf, maybe you can use it too:
http://www.seco.tkk.fi/u/oisuomin/rdflib-librdf-cheat-sheet.odt
However, 9GB sounds really big. I doubt you will be able to load the whole thing even with librdf. So splitting into smaller parts, maybe as N-triples, is still a good idea.
-Osma
26.11.2013 21:29, Benedikt Tröster kirjoitti:
wow, that sounds like much extra work.... I just need that for a small
task in a project for university. Maybe I'll try to split the file into
several files or find an other solution...
On Tuesday, November 26, 2013 6:38:31 PM UTC+1, Gunnar Aastrand Grimnes
wrote:
Hi Benedikt,
Python and RDFLib really isn't made for this sort of
data-processinging - parsing will be the least of your problems,
saving the graph in memory will also take much more space than the
rdf/xml file takes on disk.
I see a few options:
* Use a streaming parser/converter like jena's riot tool to read in
the rdf/xml and write out a line-based format line nquads/ntriples.
Then you can use grep to find the parts of the file you want, ending
up with a smaller rdf file you can process with rdflib.
* Load the file into an on-disk database like Jena TDB - then use
Jena Fuseki to setup a SPARQL endpoint and use the
SparqlWrapper/RDFLib SPARQLStore to get out the data you want.
* Use the rdflib streaming parsers I have a secret git branch on my
computer do stream-process your data ;)
Good luck!
- Gunnar
On 26 November 2013 16:47, Benedikt Tröster <ben...@gmail.com
an email to rdflib-dev+unsubscribe@googlegroups.com.
To post to this group, send email to rdfli...@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/rdflib-dev/a6ad37d9-4629-499f-9072-76a96d959cc5%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
--
Osma Suominen
D.Sc. (Tech), Information Systems Specialist
National Library of Finland
P.O. Box 26 (Teollisuuskatu 23)
00014 HELSINGIN YLIOPISTO
Tel. +358 50 3199529
osma.s...@helsinki.fi
http://www.nationallibrary.fi
--
http://github.com/RDFLib
--- You received this message because you are subscribed to the Google Groups "rdflib-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rdflib-dev+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/rdflib-dev/5295AA8F.7090902%40helsinki.fi.
To unsubscribe from this group and stop receiving emails from it, send an email to rdflib-dev+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/rdflib-dev/CADjV5jfZ4x9bO9%3DWQM43Ww8rKAp1GumwfH820tct3idy0w_w2w%40mail.gmail.com.
Hi Benedikt!
You could also try the librdf bindings for Python. The parser (raptor, a C library) is at least an order of magnitude faster than the one in rdflib, and uses much less memory (though rdflib 4.0 is much better than earlier versions in this respect, it has my set-based in-memory store). I use it when I really need speed, though the API is in my opinion not as Pythonic as rdflib. I use rdflib whenever I can.
I've also made a cheat sheet for myself and some coworkers summarizing the most important operations in both rdflib and librdf, maybe you can use it too:
http://www.seco.tkk.fi/u/oisuomin/rdflib-librdf-cheat-sheet.odt
However, 9GB sounds really big. I doubt you will be able to load the whole thing even with librdf. So splitting into smaller parts, maybe as N-triples, is still a good idea.
-Osma
26.11.2013 21:29, Benedikt Tröster kirjoitti:
wow, that sounds like much extra work.... I just need that for a small
task in a project for university. Maybe I'll try to split the file into
several files or find an other solution...
On Tuesday, November 26, 2013 6:38:31 PM UTC+1, Gunnar Aastrand Grimnes
wrote:
Hi Benedikt,
Python and RDFLib really isn't made for this sort of
data-processinging - parsing will be the least of your problems,
saving the graph in memory will also take much more space than the
rdf/xml file takes on disk.
I see a few options:
* Use a streaming parser/converter like jena's riot tool to read in
the rdf/xml and write out a line-based format line nquads/ntriples.
Then you can use grep to find the parts of the file you want, ending
up with a smaller rdf file you can process with rdflib.
* Load the file into an on-disk database like Jena TDB - then use
Jena Fuseki to setup a SPARQL endpoint and use the
SparqlWrapper/RDFLib SPARQLStore to get out the data you want.
* Use the rdflib streaming parsers I have a secret git branch on my
computer do stream-process your data ;)
Good luck!
- Gunnar
On 26 November 2013 16:47, Benedikt Tröster <ben...@gmail.com
To view this discussion on the web visit https://groups.google.com/d/msgid/rdflib-dev/5295AA8F.7090902%40helsinki.fi.