big Turtle file open

755 views
Skip to first unread message

Hala Gamal

unread,
Apr 16, 2013, 1:09:45 PM4/16/13
to rdfli...@googlegroups.com

I have big .ttl file which i'm trying to parse using rdflb usng this code:
g=rdflib.Graph ()
>>> r=g.parse ('semanticquran.ttl',format='n3')
but i have this error:
Traceback (most recent call last):
  File "<pyshell#11>", line 1, in <module>
    r=g.parse ('semanticquran.ttl',format='n3')
  File "rdflib\graph.py", line 757, in parse
    parser.parse(source, self, **args)
  File "rdflib\plugins\parsers\notation3.py", line 2250, in parse
    p.loadStream(source.getByteStream())
  File "rdflib\plugins\parsers\notation3.py", line 891, in loadStream
    return self.loadBuf(stream.read())   # Not ideal
MemoryError

Any Help?

Gunnar Aastrand Grimnes

unread,
Apr 20, 2013, 7:49:09 AM4/20/13
to rdfli...@googlegroups.com
How big is the file?

The basic, but unhelpful advice is "get more memory" :)

I've parsed quite big stuff with RDFLib without any problem - you
normally just need patience.

- Gunnar
> --
> http://github.com/RDFLib
> ---
> You received this message because you are subscribed to the Google Groups
> "rdflib-dev" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to rdflib-dev+...@googlegroups.com.
> To post to this group, send email to rdfli...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msg/rdflib-dev/-/xPAIsZHIjvYJ.
> For more options, visit https://groups.google.com/groups/opt_out.
>
>



--
http://gromgull.net

Hala Gamal

unread,
Apr 20, 2013, 8:49:48 AM4/20/13
to rdfli...@googlegroups.com
but the problem is that i have erros when i make this command:
r=g.parse ('semanticquran.ttl',format='n3')
but when i take part of my dataset it is parsed normally
so  where is the solution?

Graham Higgins

unread,
Apr 20, 2013, 7:28:16 PM4/20/13
to rdfli...@googlegroups.com
On Sat, 2013-04-20 at 05:49 -0700, Hala Gamal wrote:

> so where is the solution?

Gunnar did provide the solution in his response:

> ... "get more memory"


The exception raised is MemoryError ...

> File "rdflib\plugins\parsers\notation3.py", line 891, in loadStream
> return self.loadBuf(stream.read()) # Not ideal
> MemoryError

the Python documentation of the MemoryError exception states: "Raised
when an operation runs out of memory".

http://docs.python.org/2/library/exceptions.html#exceptions.MemoryError


Gunnar asked:

> How big is the file?

There's one on GitHub apparently, that's 738Mb of turtle.

https://github.com/kurzum/mlode2012/blob/master/data/semanticquran.ttl.gz


This semanticoverflow answer gives brief specs for one developer's
setup:

http://answers.semanticweb.com/questions/15214/what-kind-of-hardware-do-you-use-for-semantic-work/15245


HTH

--
Graham Higgins

http://bel-epa.com/gjh/

Sergio Fernández

unread,
Apr 21, 2013, 1:22:04 PM4/21/13
to rdfli...@googlegroups.com
On 21 April 2013 01:28, Graham Higgins <gjhi...@gmail.com> wrote:
This semanticoverflow answer gives brief specs for one developer's
setup:

http://answers.semanticweb.com/questions/15214/what-kind-of-hardware-do-you-use-for-semantic-work/15245

A bit oversize, isn't it?
 
__      ___ _   _          
\ \    / (_) |_(_)___ _ _  
 \ \/\/ /| | / / / -_) '_|  Sergio Fernández
  \_/\_/ |_|_\_\_\___|_|    http://www.wikier.org/

Gerhard Weis

unread,
Apr 21, 2013, 4:44:58 PM4/21/13
to rdfli...@googlegroups.com

Might it help to use the BDB storage backend a fair bit of patience?
> --
> http://github.com/RDFLib
> ---
> You received this message because you are subscribed to the Google Groups "rdflib-dev" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to rdflib-dev+...@googlegroups.com.
> To post to this group, send email to rdfli...@googlegroups.com.

Osma Suominen

unread,
Apr 22, 2013, 6:33:36 AM4/22/13
to rdfli...@googlegroups.com
Hi!

I've also processed many large files with rdflib. The memory store can
unfortunately use a lot of memory, especially on a 64bit system.

I once wrote a different memory store implementation, based on Python
sets, that, at least in my tests, uses less than half the memory
compared to the stock rdflib memory store. The code is here (as part
of the Skosify project that makes use of it):

http://code.google.com/p/skosify/source/browse/trunk/setstore.py

At the time I wrote it (more than 2 years ago) it passed all the tests
in rdflib when I put the code in place of the original memory store. I
haven't checked lately. I wrote about it on the rdflib-dev mailing
list back then, but there was little interest:
https://groups.google.com/forum/?fromgroups=#!topic/rdflib-dev/ANeDD3l1LLk

If there's more interest now, I can try to prepare the code for
inclusion in rdflib, either as a replacement or as an alternative for
the current memory backend.

-Osma

Lainaus Gunnar Aastrand Grimnes <grom...@gmail.com>:
Osma Suominen
Information Systems Specialist
National Library of Finland
P.O. Box 26 (Teollisuuskatu 23)
00014 HELSINGIN YLIOPISTO
Tel. +358 50 3199529
osma.s...@helsinki.fi
http://www.nationallibrary.fi

Gunnar Aastrand Grimnes

unread,
Apr 22, 2013, 6:41:16 AM4/22/13
to rdfli...@googlegroups.com
Hi Osma!

On 22 April 2013 12:33, Osma Suominen <osma.s...@helsinki.fi> wrote:
> If there's more interest now, I can try to prepare the code for inclusion in
> rdflib, either as a replacement or as an alternative for the current memory
> backend.

That would be great!

The current store keeps complete indexes for all combinations of spo
queries, the theory being that we trade higher memory requirements for
faster query speed.

It may be that for most queries the set-intersection in
http://code.google.com/p/skosify/source/browse/trunk/setstore.py#170
is just as fast.

I see you already optimised the "(s,p,o) in graph" where all s,p,o are
bound a bit.

I would be very interesting to see results for
test/store_performance.py for this store.

Even if querying is slower, it would be nice to offer to choice of
speed vs. memory.

Cheers,

- Gunnar



--
http://gromgull.net

Osma Suominen

unread,
Apr 22, 2013, 7:10:40 AM4/22/13
to rdfli...@googlegroups.com
Hi Gunnar!

>> If there's more interest now, I can try to prepare the code for inclusion in
>> rdflib, either as a replacement or as an alternative for the current memory
>> backend.
>
> That would be great!

OK, I'll give it a shot.

> The current store keeps complete indexes for all combinations of spo
> queries, the theory being that we trade higher memory requirements for
> faster query speed.
>
> It may be that for most queries the set-intersection in
> http://code.google.com/p/skosify/source/browse/trunk/setstore.py#170
> is just as fast.
>
> I see you already optimised the "(s,p,o) in graph" where all s,p,o are
> bound a bit.

In my own tests back in the days, if there was any difference at all,
my store was a bit faster (between 5% and 20% AFAIR).

> I would be very interesting to see results for
> test/store_performance.py for this store.

I tried to run it (first with a vanilla clone of rdflib, not yet with
my store code), but I'm not sure the script is in working shape. First
I had to change the line
store = "Memory"
to
store = "IOMemory"

in order to even run the script. Then (when running "python
test/store_performace.py" - is this the right way to run it as it's
written as a unit test?) I get this output:

--cut--
IOMemory
input: 0.000213 random: 0.000171 .
.default
input: 0.000211 random: 0.000164 .
.
----------------------------------------------------------------------
Ran 2 tests in 0.851s

OK
--cut--

It seems to me that the script is fetching and parsing
http://eikeon.com for test data, but AFAICT there's not a lot of
triples to be had there (currently 2 it seems). Maybe as a result, the
timing values are ridiculously low, and they sometimes change a lot in
subsequent runs.

(there's also a typo in the filename of the script)

> Even if querying is slower, it would be nice to offer to choice of
> speed vs. memory.

Right.

Osma Suominen

unread,
Apr 22, 2013, 7:31:52 AM4/22/13
to rdfli...@googlegroups.com

Lainaus Osma Suominen <osma.s...@helsinki.fi>:

> Hi Gunnar!
>
>>> If there's more interest now, I can try to prepare the code for
>>> inclusion in
>>> rdflib, either as a replacement or as an alternative for the current memory
>>> backend.
>>
>> That would be great!
>
> OK, I'll give it a shot.

Attached is a patch that replaces the current IOMemory implementation
in rdflib with my set-based one. run_tests.py shows that the same unit
tests passed as before.

I'm willing to run benchmarks (I'm interested in the results too!) if
anyone has ideas how to do it. The store_performace.py [sic!] script
only seems to check the performance of adding triples, and even that
doesn't work very well currently (see my previous message).

I haven't tried this with Python 3.

-Osma
setstore.diff

Gunnar Aastrand Grimnes

unread,
Apr 22, 2013, 7:32:42 AM4/22/13
to rdfli...@googlegroups.com
On 22 April 2013 13:10, Osma Suominen <osma.s...@helsinki.fi> wrote:
>
>> I would be very interesting to see results for
>> test/store_performance.py for this store.
>
>
> I tried to run it (first with a vanilla clone of rdflib, not yet with my
> store code), but I'm not sure the script is in working shape. First I had to
> change the line
> store = "Memory"
> to
> store = "IOMemory"


[...snip .. ]

My bad, I was looking for the scripts Graham used for
http://rdfextras.readthedocs.org/en/latest/store/performance.html

Seems the store_load_and_query_performance.py is no longer to be found.

Graham?

Gunnar Aastrand Grimnes

unread,
Apr 22, 2013, 7:33:46 AM4/22/13
to rdfli...@googlegroups.com
If you clone rdflib on github, apply your patch and file a
pull-request, travis will auto run the tests for all python versions
we support!

- Gunnar
> --
> http://github.com/RDFLib
> ---You received this message because you are subscribed to the Google Groups
> "rdflib-dev" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to rdflib-dev+...@googlegroups.com.
> To post to this group, send email to rdfli...@googlegroups.com.
> For more options, visit https://groups.google.com/groups/opt_out.
>
>



--
http://gromgull.net

Graham Higgins

unread,
Apr 22, 2013, 8:16:25 AM4/22/13
to rdfli...@googlegroups.com
On Mon, 2013-04-22 at 13:32 +0200, Gunnar Aastrand Grimnes wrote:


> My bad, I was looking for the scripts Graham used for
> http://rdfextras.readthedocs.org/en/latest/store/performance.html
>
> Seems the store_load_and_query_performance.py is no longer to be found.

Hurrah for version control:

https://github.com/RDFLib/rdfextras/blob/d8702b44663cd609e2266fb4dbcc609aa5b485b3/test/test_store/test_store_performance.py

Osma Suominen

unread,
Apr 22, 2013, 8:30:32 AM4/22/13
to rdfli...@googlegroups.com

Lainaus Gunnar Aastrand Grimnes <grom...@gmail.com>:

> If you clone rdflib on github, apply your patch and file a
> pull-request, travis will auto run the tests for all python versions
> we support!

Wow, that's pretty cool!

I did that, the pull request is here:
https://github.com/RDFLib/rdflib/pull/268

Got the Travis results:
https://travis-ci.org/RDFLib/rdflib/builds/6534276

Good news: my code seems to work also with Python 3.2 and 3.3.

There was a problem with Python 2.5 in the initial version. I fixed
that in a subsequent commit. Now the code passes all tests with all
Python versions (2.5, 2.6, 2.7, 3.2, 3.3).

Osma Suominen

unread,
Apr 22, 2013, 8:35:15 AM4/22/13
to rdfli...@googlegroups.com
Hi Graham!

Lainaus Graham Higgins <gjhi...@gmail.com>:
I don't think this is the right one either. It will just test load
times with input files of up to 50k triples. The interesting part
would be testing query times.

-Osma

Osma Suominen

unread,
Apr 22, 2013, 9:26:27 AM4/22/13
to rdfli...@googlegroups.com

Lainaus Osma Suominen <osma.s...@helsinki.fi>:

> I don't think this is the right one either. It will just test load
> times with input files of up to 50k triples. The interesting part
> would be testing query times.

I found this one:
http://code.google.com/p/rdfextras/source/browse/test/test_store/store_load_and_query_performance.py

I had to hack it a bit to make it work. I used the 50k triple file
from the sp2b data, downloaded from the same old Google Code repo. The
script I used is attached.

Results of a typical run with original rdflib IOMemory:

--cut--
default
std file input: 1.54
std query: 0.535
----------------------------------------------------------------------
Ran 1 test in 9.902s
--cut--

Results of a typical run with the new set-based IOMemory:

--cut--
default
std file input: 1.18
std query: 0.543
----------------------------------------------------------------------
Ran 1 test in 8.933s
--cut--

Conclusion: load times are about 25% faster while query times are
slightly slower but the difference is pretty small.

I also determined peak memory usage by inserting a raw_input() call
into the tearDown method and checking the RSS value reported by ps for
the python process when it stopped for input. I know that's probably
not the best way, but I was lazy... The memory usage with original
rdflib was 360MB, and 93MB for the new code (verified this 3 times,
always same result). So there was a reduction of almost 75% in memory
usage. The RSS figure probably includes all the Python interpreter
code etc, so with bigger input data the difference is likely even
larger.

I'm using Ubuntu 12.04 amd64, Core i5-2400 3.1GHz CPU, 8GB RAM, Python 2.7.3.
store_load_and_query_performance.py

Osma Suominen

unread,
Apr 22, 2013, 9:57:36 AM4/22/13
to rdfli...@googlegroups.com

Lainaus Osma Suominen <osma.s...@helsinki.fi>:

> Conclusion: load times are about 25% faster while query times are
> slightly slower but the difference is pretty small.

Did yet another test, this time with the 2k triples sp2b data file and
Q4 from the sp2b benchmark set:
http://dbis.informatik.uni-freiburg.de/index.php?project=SP2B/queries.php#q4

Original rdflib:

std file input: 0.0629
std query: 33.7

Set-based IOMemory:

std file input: 0.0474
std query: 38.6

So this complex query runs about 15% slower with the new memory store.

Still, I'd think that the decreased memory usage would make up for
this in many cases.

Any ideas what to do next? My pull request is pending, if you want to
take this into rdflib as a replacement for the current IOMemory. If
not, could it be included as a separate plugin and how?

Gunnar Aastrand Grimnes

unread,
Apr 22, 2013, 10:02:44 AM4/22/13
to rdfli...@googlegroups.com
I would like to test it also with more than 50k triples, which isn't
really that much - I've just made a new repository to collect the some
larger data files + your fixed test script, I'll let you know shortly.

If we include both the old and the new store, I can do it for you, or
you make another pull request. It's essentially just put the new file
next to the old one and make up a plugin name, and add this to
rdflib/plugin.py
> --
> http://github.com/RDFLib
> ---You received this message because you are subscribed to the Google Groups
> "rdflib-dev" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to rdflib-dev+...@googlegroups.com.
> To post to this group, send email to rdfli...@googlegroups.com.
> For more options, visit https://groups.google.com/groups/opt_out.
>
>



--
http://gromgull.net

Gunnar Aastrand Grimnes

unread,
Apr 22, 2013, 12:49:10 PM4/22/13
to rdfli...@googlegroups.com
I've commited larger files + some new testing code to:

https://github.com/RDFLib/rdflib-benchmark

The tests take a while, I'll let them run for some stores over night!

- Gunnar
--
http://gromgull.net

Gunnar Aastrand Grimnes

unread,
Apr 22, 2013, 2:45:30 PM4/22/13
to rdfli...@googlegroups.com
I quickly ran this for new/old memory store + sleepycat
and the results look good for the new store:

even at 1M triples, query time is not almost the same.

See attached.
--
http://gromgull.net
rdflib setmemorystore preliminary.ods

Osma Suominen

unread,
Apr 23, 2013, 2:37:03 AM4/23/13
to rdfli...@googlegroups.com
On 21/04/13 02:28, Graham Higgins wrote:

>> How big is the file?
>
> There's one on GitHub apparently, that's 738Mb of turtle.
> https://github.com/kurzum/mlode2012/blob/master/data/semanticquran.ttl.gz

Getting back on topic to the thread I hijacked earlier:

I was able to load this file (738MB Turtle, 15.7M triples) into the new
set-based memory store on my system (i5-...@3.1GHz, 8GB RAM, Ubuntu
12.04 amd64):

--cut--
Python 2.7.3 (default, Aug 1 2012, 05:14:39)
[GCC 4.6.3] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> from rdflib import *
>>> g = Graph()
>>> g.parse(open('semanticquran.ttl'), format='n3')
<Graph identifier=N627a515e642a4abeb59fec19f8c82030 (<class
'rdflib.graph.Graph'>)>
>>> len(g)
15741591
>>>
--cut--

Memory usage after loading is about 4.8GB and loading took around 23
minutes, as seen in top:

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND

3885 oisuomin 20 0 4917m 4.8g 1584 S 0 62.0 22:51.55 python


Peak memory usage during loading was over 7GB. My guess is that the N3
parser first reads the whole file into a Unicode string, because memory
usage quickly grew to around 3GB in the first few seconds and then grew
much slower during the following minutes. This memory was released back
to the OS when parsing was complete.

So with the set-based storage, this kind of data set is still possible
to load into memory if you have at least 8GB RAM. With the original
IOMemory store, you'd probably need at least 32GB.

-Osma

Osma Suominen

unread,
Apr 23, 2013, 2:46:53 AM4/23/13
to rdfli...@googlegroups.com
Hi Gunnar!

On 22/04/13 21:45, Gunnar Aastrand Grimnes wrote:
> I quickly ran this for new/old memory store + sleepycat
> and the results look good for the new store:
>
> even at 1M triples, query time is not almost the same.
>
> See attached.

The benchmark (on github) looks really good! Looks like my code hit a
nerve ;)

I'm glad to hear that query times are good for the new store. Still, I
think this will depend a lot on the query and I'd like to see results
for several different queries.

If there are lots of triples() lookups with exactly 2 bound parts, the
store might be slower than the original IOMemory. Also if you use lots
of contexts with overlapping data that is probably going to be slow, but
I don't think that's typically done with rdflib.

-Osma

Gunnar Aastrand Grimnes

unread,
Apr 23, 2013, 3:53:53 AM4/23/13
to rdfli...@googlegroups.com
I always wanted to optimize rdflib-sparql a bit, but never found the
time, so I had some of the benchmarking code lying around.

Running all queries overnight ran into some problems... by a mistake I
still had rdfextras sparql lying around, and this handily used up my
16GB of memory, even for 16000 triples.

So I ran it again with rdflib-sparql, now only for 8000 triples, but
all queries:

http://nbviewer.ipython.org/5441618

Interestingly it seems that even for sleepycat (on a SSD drive
though), the store almost doesn't matter for query execution time.

I wonder if the benchmarking code is broken :)

- Gunnar
> --
> http://github.com/RDFLib
> --- You received this message because you are subscribed to the Google

Osma Suominen

unread,
Apr 23, 2013, 4:00:34 AM4/23/13
to rdfli...@googlegroups.com
On 23/04/13 10:53, Gunnar Aastrand Grimnes wrote:
> http://nbviewer.ipython.org/5441618
>
> Interestingly it seems that even for sleepycat (on a SSD drive
> though), the store almost doesn't matter for query execution time.
>
> I wonder if the benchmarking code is broken :)

I think so. At least for Q4 I saw a difference of about 15% yesterday:
https://groups.google.com/d/msg/rdflib-dev/EZqiUs7qTUc/e47L4TW2hioJ

I think the problem is here in sp2b.py:

for _ in range(ITERATIONS):
list(data.query(q))

The 2nd line probably should be list(g.query(q))

-Osma

Gunnar Aastrand Grimnes

unread,
Apr 23, 2013, 4:05:11 AM4/23/13
to rdfli...@googlegroups.com

Well spotted, thanks !

--
http://github.com/RDFLib
--- You received this message because you are subscribed to the Google Groups "rdflib-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rdflib-dev+unsubscribe@googlegroups.com.

Gunnar Aastrand Grimnes

unread,
Apr 23, 2013, 6:00:14 AM4/23/13
to rdfli...@googlegroups.com
Updated:

http://nbviewer.ipython.org/5441618

Now there is a slight difference, but really not by much. I'll rerun
the varying number of triples thing as well.

> If there are lots of triples() lookups with exactly 2 bound parts, the store might be slower than the original IOMemory. Also if you use lots of contexts with overlapping data that is probably going to be slow, but I don't think that's typically done with rdflib.

This is pretty much the only optimisation the rdflib-sparql engine
does, it will always do the triple patterns with the most bound
patterns first, i.e. even for queries with several variables in a
pattern, it is likely they will be filled in before going to the
store.

- Gunnar

Gunnar Aastrand Grimnes

unread,
Apr 23, 2013, 2:20:13 PM4/23/13
to rdfli...@googlegroups.com
So for increasing number of triples:

q02:
http://infogr.am/query-time-for-q02sparql/

(https://raw.github.com/RDFLib/rdflib-benchmark/master/sp2b/queries/q02.sparql)

q08:
http://infogr.am/2d00a2aaee6c-4001/

(https://raw.github.com/RDFLib/rdflib-benchmark/master/sp2b/queries/q08.sparql)

So q02 is pretty simple and shows almost no difference. q08 is more
complicated, but Osma set based memory store is only a tiny bit
slower.

I guess I'll be happy to replace the current IOMemory store - although
I would do it in the reunification branch (which will become the next
release, 4.0)

- Gunnar
--
http://gromgull.net

Osma Suominen

unread,
Apr 24, 2013, 2:31:54 AM4/24/13
to rdfli...@googlegroups.com
23.04.2013 21:20, Gunnar Aastrand Grimnes kirjoitti:
> I guess I'll be happy to replace the current IOMemory store - although
> I would do it in the reunification branch (which will become the next
> release, 4.0)

Awesome! Thanks for taking the time to benchmark everything!


If anyone wants to use the new memory-efficient store in the meantime,
there are two options:

1. install the setstore branch from my rdflib fork on github:
https://github.com/osma/rdflib/tree/setstore

This will install the new store globally.

- or -

2. copy the setstore.py script into your project from Skosify:
http://code.google.com/p/skosify/source/browse/trunk/setstore.py

Then you can just use it like this:

from setstore import IOMemory
g = Graph(IOMemory())
Reply all
Reply to author
Forward
0 new messages