>>> from guppy import hpy
>>> hpy().heap()
Partition of a set of 159556 objects. Total size = 37435816 bytes.
Index Count % Size % Cumulative % Kind (class / dict of class)
0 76137 48 26520792 71 26520792 71 dict (no owner)
1 37959 24 3213264 9 29734056 79 str
2 18560 12 1555400 4 31289456 84 tuple
3 240 0 660864 2 31950320 85 dict of module
4 5029 3 643712 2 32594032 87 types.CodeType
5 4934 3 592080 2 33186112 89 function
6 686 0 565144 2 33751256 90 unicode
7 574 0 544336 1 34295592 92 dict of type
8 1984 1 529112 1 34824704 93 rdflib.term.URIRef
9 575 0 516240 1 35340944 94 type
<222 more rows. Type e.g. '_.more' to view.>
>>> al.query(query_pop[46])
<rdfextras.sparql.query.SPARQLQueryResult object at 0x2ba0ad0>
>>> hpy().heap()
Partition of a set of 175175 objects. Total size = 41181944 bytes.
Index Count % Size % Cumulative % Kind (class / dict of class)
0 76504 44 27129280 66 27129280 66 dict (no owner)
1 43524 25 3769712 9 30898992 75 str
2 21885 12 1841888 4 32740880 80 tuple
3 5902 3 755456 2 33496336 81 types.CodeType
4 432 0 733824 2 34230160 83 dict of pyparsing.Suppress
5 262 0 732304 2 34962464 85 dict of module
6 5849 3 701880 2 35664344 87 function
7 706 0 643120 2 36307464 88 dict of type
8 707 0 636368 2 36943832 90 type
9 696 0 567728 1 37511560 91 unicode
<258 more rows. Type e.g. '_.more' to view.>
>>> len(al)
9135
>>> len(al.query(query_pop[46]))
430
>>>
The first time I checked the memory usage, Heapy reports about 37 MB in use and my system reports about 52 in use by python. The second time, Heapy reports about 41 MB in use and my system reports about 308 MB! Even if I delete all variables, modules, and classes in the global namespace and manually run the garbage collector, this memory isn't released.
Does anyone have any idea why this occurs? Unfortunately I can't share the data I'm using on this project, and I haven't been able to replicate the problem with the small public examples I've tried. But if anyone has any suggestions or directions to investigate, I'd certainly appreciate it. Thank you!
> You dont say which versions of rdflib you use [...]
Sorry for forgetting that; I was running Python 2.7.3, RDFLib 3.4.0, rdfextras 0.4, and Ubuntu 12.10.
> The "new" SPARQL implementation for rdflib is actually in rdflib-sparql
Ah ha, definitely missed that. I removed rdfextras and installed rdflib-sparql 0.1 (from PyPI), and now rdflib.Graph().query is behaving much more nicely. Thank you for pointing that out!
Do you have any idea what was causing the problem in the older SPARQL implementation? I was pretty confused by the fact that the memory profiler couldn't see where all the memory was going.
-Daniel
import rdflib
import random
import string
def random_string(n):
return ''.join(random.choice(string.letters) for x in xrange(n))
def random_graph():
g = rdflib.Graph()
for i in xrange(500):
g.add((
rdflib.term.URIRef('http://foo.baz/People#' + random_string(10)),
rdflib.term.URIRef('http://www.w3.org/1999/02/22-rdf-syntax-ns#type'),
rdflib.term.URIRef('http://foo.baz/Person')
))
return g
if __name__ == '__main__':
g = random_graph()
r = g.query("SELECT * WHERE {?x1 ?x2 <http://foo.baz/Person> . ?x3 ?x4 <http://foo.baz/Person> . }")
r = []
This query asks for all pairs of links in the random graph, so there will be 250,000 results. After running this, the Python process is taking up more than 300 MB of memory. The Heapy memory profiler sees only about 20 MB of objects on the heap.
What I would expect to happen in this situation is that a lot of memory would be allocated to run the query, but that once I eliminated any references to the results of the query, all this memory would be released back to the operating system. This is what happens if I build a very long list of random strings, for example.
Is this the expected behavior for rdflib-sparql? If it is, is there any way for me to tell rdflib to release this memory (aside from isolating all the queries in a subprocess)? Or am I completely misunderstanding what's happening here? I'd very much appreciate any comments or suggestions; thank you!
-Daniel
That's a good point. But I'm still curious to know what's going on. For instance, try doing this:
from guppy import hpy
g = random_graph()
r = g.query("SELECT * WHERE {?x1 ?x2 <http://foo.baz/Person> . ?x3 ?x4 <http://foo.baz/Person> . }")
hpy().heap()
On my system, the first line of the report shows "Total size = 175417160 bytes", that is, the inside-Python memory profiler thinks Python is using about 175 MB of memory, and you can see entries near the top of the list ("dict of 0x1665e20" in my case) that are presumably related to RDFLib objects. The outside-Python system monitor thinks Python is using about 330 MB of memory. Now, do
del r
hpy().heap()
The inside-Python profiler now reports about 17 MB of memory in use (so ten times less), and the outside-Python monitor reports about 330 MB (the same as before). So, although memory usage may not be growing in a loop, it appears to me that something strange is still going on. The Python interpreter thinks that it *has* released this memory, the garbage collector doesn't know about it, but it's still not released back to the OS.
> I guess even though the python objects are freed, python is just slow
> at returning the memory to the OS?
I thought that memory would be released back to the OS very quickly - if I build a large list of strings and then delete it, the memory is released immediately. For example,
l = []
for i in xrange(500000):
l.append(random_string(20))
OS reports 45 MB in use, but after running 'del l' this immediately drops to 9 MB.
I'm really too much of a Python neophyte to understand if this is all normal somehow, and that the interpreter is just optimizing something or doing things behind the scenes that I'm unaware of. And I'm sorry to continue to push on this issue. I'm still trying to understand what's going on here because, for the problem I'm working on, I would like to be able to process SPARQL queries that return very many results, but I do not need all of those results at once. As it stands, though, even writing something like
for r in g.query( ... ):
pass
uses as much memory as storing all the query results in a list at once. If there's any way that I can reduce the memory requirements for g.query() it would be a big help. Thanks for any comments!
-Daniel