Memory issues with SPARQL

58 views
Skip to first unread message

d.c....@gmail.com

unread,
Apr 1, 2013, 7:20:38 PM4/1/13
to rdfli...@googlegroups.com
Hello all, I've been running in to memory problems when executing SPARQL queries against my database. I'm working with a set of about 10,000 triples, and I've noticed that sometimes running a SPARQL query will cause the Python interpreter to use quite a bit of memory, which isn't ever released once the query has finished. Moreover, I tried using Heapy to check where this memory is going, and it's not even reported by this tool. Here's a specific example. My rdflib.Graph is named 'al'; I check the heap size, run one of the problematic queries, and then check the heap size again.

>>> from guppy import hpy
>>> hpy().heap()
Partition of a set of 159556 objects. Total size = 37435816 bytes.
Index Count % Size % Cumulative % Kind (class / dict of class)
0 76137 48 26520792 71 26520792 71 dict (no owner)
1 37959 24 3213264 9 29734056 79 str
2 18560 12 1555400 4 31289456 84 tuple
3 240 0 660864 2 31950320 85 dict of module
4 5029 3 643712 2 32594032 87 types.CodeType
5 4934 3 592080 2 33186112 89 function
6 686 0 565144 2 33751256 90 unicode
7 574 0 544336 1 34295592 92 dict of type
8 1984 1 529112 1 34824704 93 rdflib.term.URIRef
9 575 0 516240 1 35340944 94 type
<222 more rows. Type e.g. '_.more' to view.>
>>> al.query(query_pop[46])
<rdfextras.sparql.query.SPARQLQueryResult object at 0x2ba0ad0>
>>> hpy().heap()
Partition of a set of 175175 objects. Total size = 41181944 bytes.
Index Count % Size % Cumulative % Kind (class / dict of class)
0 76504 44 27129280 66 27129280 66 dict (no owner)
1 43524 25 3769712 9 30898992 75 str
2 21885 12 1841888 4 32740880 80 tuple
3 5902 3 755456 2 33496336 81 types.CodeType
4 432 0 733824 2 34230160 83 dict of pyparsing.Suppress
5 262 0 732304 2 34962464 85 dict of module
6 5849 3 701880 2 35664344 87 function
7 706 0 643120 2 36307464 88 dict of type
8 707 0 636368 2 36943832 90 type
9 696 0 567728 1 37511560 91 unicode
<258 more rows. Type e.g. '_.more' to view.>
>>> len(al)
9135
>>> len(al.query(query_pop[46]))
430
>>>


The first time I checked the memory usage, Heapy reports about 37 MB in use and my system reports about 52 in use by python. The second time, Heapy reports about 41 MB in use and my system reports about 308 MB! Even if I delete all variables, modules, and classes in the global namespace and manually run the garbage collector, this memory isn't released.

Does anyone have any idea why this occurs? Unfortunately I can't share the data I'm using on this project, and I haven't been able to replicate the problem with the small public examples I've tried. But if anyone has any suggestions or directions to investigate, I'd certainly appreciate it. Thank you!

Gunnar Aastrand Grimnes

unread,
Apr 2, 2013, 3:13:40 AM4/2/13
to rdfli...@googlegroups.com
Hi d.c.grady,

You dont say which versions of rdflib you use, I see you are using
rdfextras for sparql support though.

The "new" SPARQL implementation for rdflib is actually in
rdflib-sparql (pip install
https://github.com/RDFLib/rdflib-sparql/archive/master.zip) - I am
hoping to only have to support this one.

Now, I cannot make any guarantees for just this memory issue, the new
implementation passes all tests, but I've not really stress-tested it.

It would be great if you could give it a try!

- Gunnar
> --
> You received this message because you are subscribed to the Google Groups "rdflib-dev" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to rdflib-dev+...@googlegroups.com.
> To post to this group, send email to rdfli...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msg/rdflib-dev/-/dOJNlGTVZQEJ.
> For more options, visit https://groups.google.com/groups/opt_out.
>
>



--
http://gromgull.net

d.c....@gmail.com

unread,
Apr 2, 2013, 5:38:52 PM4/2/13
to rdfli...@googlegroups.com
On Tuesday, April 2, 2013 12:13:40 AM UTC-7, Gunnar Aastrand Grimnes wrote:

> You dont say which versions of rdflib you use [...]

Sorry for forgetting that; I was running Python 2.7.3, RDFLib 3.4.0, rdfextras 0.4, and Ubuntu 12.10.

> The "new" SPARQL implementation for rdflib is actually in rdflib-sparql

Ah ha, definitely missed that. I removed rdfextras and installed rdflib-sparql 0.1 (from PyPI), and now rdflib.Graph().query is behaving much more nicely. Thank you for pointing that out!

Do you have any idea what was causing the problem in the older SPARQL implementation? I was pretty confused by the fact that the memory profiler couldn't see where all the memory was going.

-Daniel

d.c....@gmail.com

unread,
Apr 11, 2013, 12:11:19 PM4/11/13
to rdfli...@googlegroups.com, d.c....@gmail.com
Hi again, although switching to rdflib-sparql alleviated the issue I was experiencing to some extent, I eventually realized I was still running in to a similar problem, and this time I can actually provide a complete example. The problem is this: when I run a SPARQL query, Python allocates some amount of memory that never gets deallocated later. Here's an example, which I tested under Python 2.7.3, rdflib 3.4.0, rdflib_sparql 0.1, and Ubuntu 12.10.

import rdflib
import random
import string

def random_string(n):
return ''.join(random.choice(string.letters) for x in xrange(n))

def random_graph():
g = rdflib.Graph()
for i in xrange(500):
g.add((
rdflib.term.URIRef('http://foo.baz/People#' + random_string(10)),
rdflib.term.URIRef('http://www.w3.org/1999/02/22-rdf-syntax-ns#type'),
rdflib.term.URIRef('http://foo.baz/Person')
))
return g

if __name__ == '__main__':
g = random_graph()
r = g.query("SELECT * WHERE {?x1 ?x2 <http://foo.baz/Person> . ?x3 ?x4 <http://foo.baz/Person> . }")
r = []

This query asks for all pairs of links in the random graph, so there will be 250,000 results. After running this, the Python process is taking up more than 300 MB of memory. The Heapy memory profiler sees only about 20 MB of objects on the heap.

What I would expect to happen in this situation is that a lot of memory would be allocated to run the query, but that once I eliminated any references to the results of the query, all this memory would be released back to the operating system. This is what happens if I build a very long list of random strings, for example.

Is this the expected behavior for rdflib-sparql? If it is, is there any way for me to tell rdflib to release this memory (aside from isolating all the queries in a subprocess)? Or am I completely misunderstanding what's happening here? I'd very much appreciate any comments or suggestions; thank you!

-Daniel

Gunnar Aastrand Grimnes

unread,
Apr 12, 2013, 2:57:02 AM4/12/13
to rdfli...@googlegroups.com, d.c....@gmail.com
> Is this the expected behavior for rdflib-sparql?

No :)

But I've not come across it myself (mainly having had short-lived
processes that sparql)

> If it is, is there any way for me to tell rdflib to release this memory (aside from isolating all the queries in a subprocess)? Or am I completely misunderstanding what's happening here? I'd very much appreciate any comments or suggestions; thank you!

I guess there is some temp structure that does not get garbage
collected for some reason - I will investigate!

- Gunnar






--
http://gromgull.net

Gunnar Aastrand Grimnes

unread,
Apr 12, 2013, 5:33:03 AM4/12/13
to rdfli...@googlegroups.com, d.c....@gmail.com
Hmm - actually I dont think there is a problem here.

Putting a loop around the query and doing it over and over again
forever doesnt lead to an infinitely growing python process. It bobs
up and down a bit, but stays at roughly 300mb ...

Also, checking for uncollectable reference cycles with the gc module
shows nothing.

I guess even though the python objects are freed, python is just slow
at returning the memory to the OS?

Gunnar
> --
> You received this message because you are subscribed to the Google Groups "rdflib-dev" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to rdflib-dev+...@googlegroups.com.
> To post to this group, send email to rdfli...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msg/rdflib-dev/-/GBu5I_lw_8cJ.

d.c....@gmail.com

unread,
Apr 15, 2013, 6:52:49 PM4/15/13
to rdfli...@googlegroups.com, d.c....@gmail.com
On Friday, April 12, 2013 2:33:03 AM UTC-7, Gunnar Aastrand Grimnes wrote:
> Putting a loop around the query and doing it over and over again
> forever doesnt lead to an infinitely growing python process. It bobs
> up and down a bit, but stays at roughly 300mb ...
>
> Also, checking for uncollectable reference cycles with the gc module
> shows nothing.

That's a good point. But I'm still curious to know what's going on. For instance, try doing this:

from guppy import hpy


g = random_graph()
r = g.query("SELECT * WHERE {?x1 ?x2 <http://foo.baz/Person> . ?x3 ?x4 <http://foo.baz/Person> . }")

hpy().heap()

On my system, the first line of the report shows "Total size = 175417160 bytes", that is, the inside-Python memory profiler thinks Python is using about 175 MB of memory, and you can see entries near the top of the list ("dict of 0x1665e20" in my case) that are presumably related to RDFLib objects. The outside-Python system monitor thinks Python is using about 330 MB of memory. Now, do

del r
hpy().heap()

The inside-Python profiler now reports about 17 MB of memory in use (so ten times less), and the outside-Python monitor reports about 330 MB (the same as before). So, although memory usage may not be growing in a loop, it appears to me that something strange is still going on. The Python interpreter thinks that it *has* released this memory, the garbage collector doesn't know about it, but it's still not released back to the OS.

> I guess even though the python objects are freed, python is just slow
> at returning the memory to the OS?

I thought that memory would be released back to the OS very quickly - if I build a large list of strings and then delete it, the memory is released immediately. For example,

l = []
for i in xrange(500000):
l.append(random_string(20))

OS reports 45 MB in use, but after running 'del l' this immediately drops to 9 MB.

I'm really too much of a Python neophyte to understand if this is all normal somehow, and that the interpreter is just optimizing something or doing things behind the scenes that I'm unaware of. And I'm sorry to continue to push on this issue. I'm still trying to understand what's going on here because, for the problem I'm working on, I would like to be able to process SPARQL queries that return very many results, but I do not need all of those results at once. As it stands, though, even writing something like

for r in g.query( ... ):
pass

uses as much memory as storing all the query results in a list at once. If there's any way that I can reduce the memory requirements for g.query() it would be a big help. Thanks for any comments!

-Daniel

Gunnar Aastrand Grimnes

unread,
Apr 22, 2013, 6:24:34 AM4/22/13
to rdfli...@googlegroups.com, Daniel Grady
In general, maybe this explains a bit:

http://effbot.org/pyfaq/why-doesnt-python-release-the-memory-when-i-delete-a-large-object.htm

For rdflib sparql, ideally, the query evaluation would all be done
with iterators. I started implementing it like this - but there were
so many places where you have to build up the whole result list in
memory I figured I'll save myself some trouble and just pass the whole
list around always for the initial version. So far, I've yet to
improve on this (and dont hold your breath :)

I made a ticket though : https://github.com/RDFLib/rdflib/issues/267

- Gunnar
> --
> http://github.com/RDFLib
> ---
> You received this message because you are subscribed to the Google Groups "rdflib-dev" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to rdflib-dev+...@googlegroups.com.
> To post to this group, send email to rdfli...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msg/rdflib-dev/-/bV8lQKgRkEwJ.
Reply all
Reply to author
Forward
0 new messages