[xappy-discuss] Massive memory leak

8 views
Skip to first unread message

Dominic LoBue

unread,
Mar 9, 2010, 12:48:58 PM3/9/10
to xappy-discuss
Hello,

I am seeing a massive memory leak in xappy, but I'm having trouble
pinning it down exactly.

Some background: I'm using xappy to index my email and store useful
headers for quick access.

In the program I'm developing I will open a new search connection,
perform a query, copy all the header information into a custom
container class, and then close the search connection. I have found
that as I keep performing these operations my program continues to use
more and more ram, and never releases anything.

Here's a really simple example that makes the problem obvious:
import pdb
import xappy
from overwatch import xapidx
from databasics import msg_factory
sconn = xappy.SearchConnection(xapidx)
r = sconn.search(sconn.query_all(), 0, 99999999, checkatleast= -1,
sortby= '-sent')
r = map(msg_factory, r)
del r
del sconn
pdb.set_trace()


msg_factory is just a factory function that returns a named tuple that
contains all the header information contained in the ProcessedDocument
it gets.

Running that script on my machine and running `ps aux` when it starts
pdb I see that the script is using 128568k, or ~125megs of ram. Now,
correct me if I'm wrong, but since I've deleted all objects, shouldn't
the only things using up memory still be the python interpreter, and
everything I imported?

I'm using the latest xappy from trunk and xapian 1.0.17.

Any idea how to fix this?

Dominic

boult...@googlemail.com

unread,
Mar 9, 2010, 5:53:37 PM3/9/10
to xappy-discuss

If you add a pdb.set_trace() before calling any xappy methods (eg,
just before the assignment to sconn), what do you see in terms of
memory usage at that point?

If I use "ps v" to display memory use, I see that my version is using
245Mb of virtual memory and 19Mb of resident memory at that point -
ie, before doing a search. After a search, that goes up to 247, and a
slight increase in the resident memory. Python just uses a lot of
memory on startup; particularly if you have a lot of stuff installed,
and doesn't necessarily release it back to the operating system
quickly after use.

If you try putting the search into a loop, just running the same
search repeatedly, and memory increased continuously, that would
convince me there's a memory leak here.

--
Richard

Kapil Thangavelu

unread,
Mar 9, 2010, 6:14:37 PM3/9/10
to xappy-...@googlegroups.com
i'd try something like meliae from pdb to dump your reference counts, that should give your some insight if your generating objects cycles that are the source of the leaks.
http://jam-bazaar.blogspot.com/2009/11/memory-debugging-with-meliae.html


--
You received this message because you are subscribed to the Google Groups "xappy-discuss" group.
To post to this group, send email to xappy-...@googlegroups.com.
To unsubscribe from this group, send email to xappy-discus...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/xappy-discuss?hl=en.


Dominic LoBue

unread,
Mar 9, 2010, 11:15:41 PM3/9/10
to xappy-...@googlegroups.com
> You received this message because you are subscribed to the Google Groups "xappy-discuss" group.
> To post to this group, send email to xappy-...@googlegroups.com.
> To unsubscribe from this group, send email to xappy-discus...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/xappy-discuss?hl=en.
>
>

Thanks for the quick response.

To answer your first question, memory usage after opening a
SearchConnection but before searching is:
VSZ: 244960
RSS: 24312

After the query is run:
VSZ: 4151212
RSS: 25056

After mapping my msg_factory function against the search results:
VSZ: 4235424
RSS: 125288

After closing the connection to the database and deleting the results
object and search connection object:
VSZ: 329172
RSS: 124936

Alright, I tried the loop test you suggested and I discovered that ram
was reallocated as it should. I modified the test to more accurately
reflect what's going on in my program, and I was able to recreate the
memory leak.

Results from loop test:
Loop1:
VSZ: 349624
RSS: 185196
Loop2:
VSZ: 508312
RSS: 289244
Loop3:
VSZ: 574992
RSS: 366412

Loop test code:
**********************************************************************************************
def get_threads(sconn, query):
__threads = search(sconn, query, True)
__threads = map(lambda x: x.data['thread'][0], __threads)
return __threads

def get_members(sconn, tids):
#compose the query
query = map(lambda x: sconn.query_field('thread', x, sconn.OP_OR), tids)
query = sconn.query_composite( sconn.OP_OR, query )

results = search(sconn, query)
results = map(msg_factory, results)
return results

def search(sconn, query, collapse=False):
__searchargs = (0, 99999999)
__searchkwargs = {'checkatleast': -1, 'sortby': '-sent'}
if collapse: __searchkwargs['collapse'] = 'thread'
__results = sconn.search(query, *__searchargs, **__searchkwargs)
return __results

cont = thread_container()
c = 0

while c < 5:
sconn = xappy.SearchConnection(xapidx)
qall = sconn.query_all()
tids = get_threads(sconn, qall)
to_join = get_members(sconn, tids)
sconn.close()
cont.thread(to_join)
pdb.set_trace()
c+=1
**********************************************************************************************


The reason those functions are built the way they are is for a number
of reasons:
-I couldn't figure out how to take a result set and collapse them on a field.
-Nor could I figure out how to expand a collapsed result set.
-I need to know if there are any new documents.
-This way allows me to avoid having to sort my results after the fact.
-I can build my conversation widget list incrementally.

If there's a better way of doing things, I'm open to suggestions. As
it is, other then python eating tons of ram, the way I've built it
works quite well.

--
Dominic LoBue

Dominic LoBue

unread,
Mar 9, 2010, 11:23:02 PM3/9/10
to xappy-...@googlegroups.com

Kapil,

Funny you should mention meliae, because I've been using it and
objgraph in order to try and debug this problem.

What I found is that while `ps aux` reports python is using > 120 megs
of ram, meliae only reports python using 44 megs.

From my testing I was able to rule out everything except xappy/xapian.
I figure either something is not being destructed in xapian by xappy,
or there's a memory leak in xapian or the python swig xapian bindings.
Since the answer is out of my league, I thought I'd start with xappy
first and work my way down.

--
Dominic LoBue

Richard Boulton

unread,
Mar 10, 2010, 4:56:16 AM3/10/10
to xappy-...@googlegroups.com
Your test code appears to be calling cont.thread(to_join) in the loop, which
sounds from your description like something which will store the results in
something. You've not provided the code for thread_container() and
msg_factory() so I don't know what they're doing, but my guess would be that
the cont.thread(to_join) call is causing more data to be stored each
time, which is why the
memory usage increases each time you go around the loop. Try taking
that call out and see how the memory usage behaves.

It's unlikely that there's a memory leak in xapian core (the C++ code)
- that code is very thoroughly tested for this kind of thing.

It's plausible there's a leak in the xapian bindings, but you're not
doing anything unusual and I'd expect to have found it by now. The
same goes for xappy - I'm using it in production environments in
long-running servers, and not observing any memory leak type of
behaviour. So, most likely is that something in your code is causing
data to be retained. Second most likely is that you're doing
something with xappy which my long-running servers don't do - but I
can't see offhand what that would be.

--
Richard

Dominic LoBue

unread,
Mar 10, 2010, 7:01:04 AM3/10/10
to xappy-...@googlegroups.com
> You received this message because you are subscribed to the Google Groups "xappy-discuss" group.
> To post to this group, send email to xappy-...@googlegroups.com.
> To unsubscribe from this group, send email to xappy-discus...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/xappy-discuss?hl=en.
>
>

Richard,

I didn't think they were relevant, so I didn't include them before. So
I don't needlessly fill up my emails with code, you can view
msg_factory and thread_container here:
http://github.com/dlobue/achelois/blob/master/achelois/databasics.py
To save you some time though, the thread method for
thread_container filters out anything it already has.

I did as you suggested and took the cont.thread(to_join) out of the
loop. Here are the results:
VSZ RSS
332756 166708
426628 260720
426628 260856
492164 276712


Here's what the query is typically set to in the get_members function
if I'm getting everything (8990 messages, >2500 email threads):
http://gist.github.com/327791

--
Dominic LoBue

Richard Boulton

unread,
Mar 10, 2010, 7:29:27 AM3/10/10
to xappy-...@googlegroups.com
On 10 March 2010 12:01, Dominic LoBue <dom....@gmail.com> wrote:
>   To save you some time though, the thread method for
> thread_container filters out anything it already has.

I've just looked at your code briefly: I don't think it does. It
seems to call a "join" method on each result which merges the result
with existing results if the id is already found. (A style point
about your code - you don't need to prefix all the local variables you
use in functions with __ - in fact, you shouldn't do so. Variables
defined inside functions are local scope anyway.)

> I did as you suggested and took the cont.thread(to_join) out of the
> loop. Here are the results:
>  VSZ     RSS
> 332756 166708
> 426628 260720
> 426628 260856
> 492164 276712

That doesn't look like convincing evidence of a leak to me: the memory
only went up significantly after the first query. The fluctuation
after that could easily be to do with the garbage collector.

--
Richard

Dominic LoBue

unread,
Mar 10, 2010, 10:12:26 AM3/10/10
to xappy-...@googlegroups.com
On Wed, Mar 10, 2010 at 4:29 AM, Richard Boulton <ric...@tartarus.org> wrote:
> On 10 March 2010 12:01, Dominic LoBue <dom....@gmail.com> wrote:
>>   To save you some time though, the thread method for
>> thread_container filters out anything it already has.
>
> I've just looked at your code briefly: I don't think it does.  It
> seems to call a "join" method on each result which merges the result
> with existing results if the id is already found.

Right. The thread_container puts all conversation objects into a list
(for sorting by the date of the most recently received email), and
adds a mapping of thread id to the conversation object to a
WeakValueDict for quick access.

When the thread_container is told to thread a conversation object into
itself, it first tries to merge the new conversation object into an
already existing conversation object with the same thread id. If one
isn't found, its a new conversation and is just appended.

If a conversation with the same thread id is found, I merge the new
conversation into the existing one. First I update the attributes
nique_terms and labels (which are sets) with the corresponding
attributes from the new conversation. Since they are sets, they won't
copy over anything they already have.

Lastly I do this:
def do_insort(x):
insort_right(self.messages, x)
self.muuids.extend(x.muuid)

map(do_insort,
filter(lambda x: x.muuid[0] not in self.muuids,
dispose.messages))

Which probably should multiple steps instead of one giant one, but it
gets the job done and makes sure that nothing is duplicated.

>(A style point
> about your code - you don't need to prefix all the local variables you
> use in functions with __ - in fact, you shouldn't do so.  Variables
> defined inside functions are local scope anyway.)
>

Thanks for the tip!

>> I did as you suggested and took the cont.thread(to_join) out of the
>> loop. Here are the results:
>>  VSZ     RSS
>> 332756 166708
>> 426628 260720
>> 426628 260856
>> 492164 276712
>
> That doesn't look like convincing evidence of a leak to me: the memory
> only went up significantly after the first query.  The fluctuation
> after that could easily be to do with the garbage collector.
>
> --

> You received this message because you are subscribed to the Google Groups "xappy-discuss" group.
> To post to this group, send email to xappy-...@googlegroups.com.
> To unsubscribe from this group, send email to xappy-discus...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/xappy-discuss?hl=en.
>
>

That's great, but still leaves me with xapian eating up gobs of ram,
which is a big concern since my program is a desktop application.

Is there any way to coax xapian to release memory it allocated for old
search results/searchconnects/whatever?

--
Dominic LoBue

Richard Boulton

unread,
Mar 10, 2010, 10:29:28 AM3/10/10
to xappy-...@googlegroups.com
On 10 March 2010 15:12, Dominic LoBue <dom....@gmail.com> wrote:
> That's great, but still leaves me with xapian eating up gobs of ram,
> which is a big concern since my program is a desktop application.
>
> Is there any way to coax xapian to release memory it allocated for old
> search results/searchconnects/whatever?

I suppose you could try telling the garbage collector to do a run:

import gc
gc.collect()

None of Xapian, the Xapian bindings, and Xappy cache previously
calculated results, so if you're not linking to them from your code
still, they should be available for garbage collection.

Note that Python doesn't necessarily make memory available to the
operating system again just because the python objects which were
using it have been unlinked: Python has its own memory allocator which
may well be keeping memory allocated which is no longer in use by any
python objects in the process.

One approach, if calling gc.collect() directly doesn't help, could be
to try to allocate fewer objects at a time: your code looks like it
makes various large sets of things; you could try changing it to
compute only those results needed for display.

--
Richard

Richard Boulton

unread,
Mar 10, 2010, 10:56:26 AM3/10/10
to xappy-...@googlegroups.com
You might also find it interesting to look at the output of
len(gc.get_objects()) - this displays the number of objects known to
the garbage collector. I've just run various tests using this to
check if objects are being leaked by xappy when running a query in a
loop, and have not seen any cases where the count rises each time the
loop is called.

--
Richard

Dominic LoBue

unread,
Mar 12, 2010, 12:06:53 PM3/12/10
to xappy-...@googlegroups.com
> You received this message because you are subscribed to the Google Groups "xappy-discuss" group.
> To post to this group, send email to xappy-...@googlegroups.com.
> To unsubscribe from this group, send email to xappy-discus...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/xappy-discuss?hl=en.
>
>

Richard,

I wanted to thank you for all your help in diagnosing this problem the
other day.

In case anybody is curious about what the cause of my massive use of
memory was, I believe it is memory fragmentation.

I just read that when you assign a string to a new namespace it
creates a new object in memory instead of reusing the original one. So
when I manipulate all the results from xapian into my data containers
simultaneously, I'm also fragmenting my ram all to hell.

No solution yet, but at least I (think I) know what the problem is now.

--
Dominic LoBue

Reply all
Reply to author
Forward
0 new messages