[reportlab-users] memory leak in reportlab xmllib.FastXMLParser

Mirko Dziadzka

unread,

Dec 17, 2013, 11:14:12 AM12/17/13

to reportl...@lists2.reportlab.com

Hi

I’m not sure if this is the right list for a bug report, any pointers to another address are welcome.

Problem description
===============

The following program creates a memory leak.

As a result, using reportlab.platypus.paraparser.ParaParser creates a memory leak too.
As a result, wordaxe has a memory leak (my original problem)

# show memory leak
import gc
from reportlab.lib import xmllib

assert xmllib.sgmlop # check that we are using FastXMLParser

while True:
parser = xmllib.XMLParser(verbose=0)
parser.close()
gc.collect()

How to reproduce
==============

Just start this program and watch the memory going up … see the 6th column in the ps output below

$ while sleep 10 ; do ps auxww | grep python | grep -v grep ; done
mirko 1023 100,0 0,3 2467680 25608 s000 R+ 5:10pm 1:00.44 python t.py
mirko 1023 100,0 0,3 2469728 27424 s000 R+ 5:10pm 1:10.47 python t.py
mirko 1023 99,3 0,3 2471520 29048 s000 R+ 5:10pm 1:20.50 python t.py
mirko 1023 100,0 0,4 2472288 30616 s000 R+ 5:10pm 1:30.54 python t.py

I tested this with reportlab-2.5 and reportlab-2.7 on CentOS-6-64bit and MacOS 10.8 with Python 2.7 and Python 2.6

Analysis
=======

It seems that there is a cyclic reference between FastXMLParser and sgmlop and parser.close() is not cleaning up.

Using the SlowXMLParser instead of XMLParser is working fine.

_______________________________________________
reportlab-users mailing list
reportl...@lists2.reportlab.com
http://two.pairlist.net/mailman/listinfo/reportlab-users

Robin Becker

unread,

Dec 17, 2013, 12:13:23 PM12/17/13

to reportlab-users

I suspect we've had this problem for a long time in the sgmlop c extension
module. Originally the paraparser.py module used to create just one parser for
use by all the paragraph instances, but I think Andy recent changed that to
create one for each instance in a vain attempt to stop threading issues.

I'll take a look at the sgmlop.c code to see if I can spot this; that code is
very old and not our originally. It certainly doesn't cooperate with GC at all.

On 17/12/2013 16:14, Mirko Dziadzka wrote:
> Hi
>
> I’m not sure if this is the right list for a bug report, any pointers to another address are welcome.
>
> Problem description
> ===============
>
> The following program creates a memory leak.
>
> As a result, using reportlab.platypus.paraparser.ParaParser creates a memory leak too.
> As a result, wordaxe has a memory leak (my original problem)
>

........

--
Robin Becker

Mirko Dziadzka

unread,

Dec 17, 2013, 12:35:55 PM12/17/13

to reportlab-users

Using the Slow Parser by disabling the sgmlop solved my immediate problem. Our paragraphs are short ;-)

After a second look on FastXMLParser, there seems to be 2 references to the sgmlop parser.

self.parser = sgmlop.XMLParser()
self.feed = self.parser.feed

and close() is only setting self.parser to None, self.feed still has a reference to sgmlop.

changing FastXMLParser.close() by adding a

self.feed = None

removes the memory leak in the simple example.

OTOH ... I'm not sure if paraparser is calling close() on the xmllib in all cases.

Robin Becker

unread,

Dec 17, 2013, 12:53:38 PM12/17/13

to reportlab-users

I find this fixes the issue

> diff -r d7705185366c src/reportlab/lib/xmllib.py
> --- a/src/reportlab/lib/xmllib.py Tue Dec 17 13:58:46 2013 +0000
> +++ b/src/reportlab/lib/xmllib.py Tue Dec 17 17:52:50 2013 +0000
> @@ -524,6 +524,8 @@
> try:
> self.parser.close()
> finally:
> + self.feed = None
> + del self.parser
> self.parser = None
>
> # Interface -- translate references

I will check this in the trunk code tomorrow

On 17/12/2013 17:35, Mirko Dziadzka wrote:
>
...........

Mirko Dziadzka

unread,

Dec 17, 2013, 2:35:01 PM12/17/13

to reportlab-users

This solves the immediate problem. However, the problem reappear on higher levels.

All code using ParaParser must call close() to free the memory.

But even reportlab internal code does not do this in every case.
The program below reconstruct the memory leak on a higher level.

import gc

from reportlab.platypus.paragraph import Paragraph
from reportlab.lib.styles import ParagraphStyle

normalStyle = ParagraphStyle('normal')

# construct input which can not be decoded as utf-8
# this will throw an exception in the parser
brokenInput = unichr(228).encode("iso-8859-1")

while True:
try:
p = Paragraph(brokenInput, normalStyle)
except Exception, e:
# will complain about invalid encoding ....
# NOTE that I have NO WAY of cleaning up the memory here ...
pass
gc.collect()

Robin Becker

unread,

Dec 17, 2013, 3:06:51 PM12/17/13

to reportlab-users

I knew this would be the case; the problem is how to guarantee that
the parser gets collected properly in every case. I think this can be
done in Paragraph._setup, but there may be a way to use weak
references to ensure that the fast parser's __del__ is closed. A
solution in the fast parser would be preferable.

Robin Becker

unread,

Dec 18, 2013, 6:27:09 AM12/18/13

to reportlab-users

On 17/12/2013 19:35, Mirko Dziadzka wrote:
> This solves the immediate problem. However, the problem reappear on higher levels.
>
> All code using ParaParser must call close() to free the memory.
>
> But even reportlab internal code does not do this in every case.
> The program below reconstruct the memory leak on a higher level.
>

.........
Interestingly, with my fix in place the latest code doesn't fail with your
example, I guess the error happens too early. I have now tried to close off the
leak with changes to xmllib.py, paraparser.py & paragraph.py. I tested using the
scripts below. The case for frags was important, because originally we created
a parser in Paragraph._setup and then never used it if frags was present.

####################################################
import gc, time

from reportlab.lib import xmllib

assert xmllib.sgmlop # check that we are using FastXMLParser

gc.set_debug(gc.DEBUG_LEAK)

while True:
parser = xmllib.XMLParser(verbose=0)
parser.close()
gc.collect()

####################################################

import gc
from reportlab.platypus.paragraph import Paragraph
from reportlab.lib.styles import ParagraphStyle

normalStyle = ParagraphStyle('normal')

# construct input which can not be decoded as utf-8
# this will throw an exception in the parser

while True:
try:
p = Paragraph('<para> </a>', normalStyle)

except Exception, e:
# will complain about invalid encoding ....
# NOTE that I have NO WAY of cleaning up the memory here ...
pass
gc.collect()

####################################################

import gc
from reportlab.platypus.paragraph import Paragraph
from reportlab.lib.styles import ParagraphStyle

normalStyle = ParagraphStyle('normal')

# construct input which can not be decoded as utf-8
# this will throw an exception in the parser

while True:
try:
p = Paragraph('<para> </a>', normalStyle, frags=[(1,'aaa')])

except Exception, e:
# will complain about invalid encoding ....
# NOTE that I have NO WAY of cleaning up the memory here ...
pass
gc.collect()

####################################################

Andy Robinson

unread,

Dec 18, 2013, 12:28:50 PM12/18/13

to reportlab-users, Mirko Dziadzka

Hi Mirko (and everyone),

I suggest it's not worth too much time on this.

The last stage of our port to python 3.3 compatibility is for us to
change the ParaParser to work on something available in python 2.7 and
3.3 and get rid of sgmlop/xmllib forever. I think we will be there
within two weeks.

The problem is that we need to parse a lot of little chunks of text,
and the available C based parsers need some expensive setup (e.g. to
set up all the entities), then we loop over them in Python anyway.
After various speed experiments, I have concluded that there is no
performance benefit to messing around with expat/etree/lxml/pyRXP, so
I'm currently trying to rewrite paraparser.py using the html.parser in
Python's standard library. This will allow us to be fairly tolerant
of poor markup, and to initialize a parser object quickly. And we
can get rid of sgmlop/xmllib forever. I would hope that a parser in
the standard library is leak free; if not at least it's Somebody
Else's Problem ;-)

Once we get this done, we hope to 'juggle branches' so that the
default code is running the new paraparser and work towards a release
in January or early February.

- Andy

--
Andy Robinson
Managing Director
ReportLab Europe Ltd.
Thornton House, Thornton Road, Wimbledon, London SW19 4NG, UK
Tel +44-20-8405-6420

Reply all

Reply to author

Forward