rfc3696 URL validation memory leak?

48 views
Skip to first unread message

Osma Suominen

unread,
May 10, 2012, 1:53:20 AM5/10/12
to lepl
Hi all

I'd like to use the LEPL rfc3696 module for URL/URI validation. But
when I added validation into my application (which processes RDF data
with many URLs, some of them broken), its memory usage jumped through
the roof. It seems to me that LEPL leaks a significant amount of
memory when validating URLs.

This simple test script that validates 10000 generated URLs takes
about 500MB memory on my system (Ubuntu 12.04 amd64, Python 2.7.3,
LEPL 5.1.1 installed via easy_install):

#!/usr/bin/env python
from lepl.apps.rfc3696 import HttpUrl
URLS = 10000
print "validating %d URLs" % URLS
validator = HttpUrl()
for i in xrange(URLS):
url = "http://example.org/%d" % i
validator(url)
print "done, press enter"
raw_input()

If I change the script to validate the same URL over and over, memory
usage goes back to normal. So maybe LEPL is storing (fragments of?)
the URLs somewhere. In this case I'm only interested in the validation
result (True/False), though. I would expect GC to reclaim any memory
after validation.

I also tried moving the HttpUrl constructor inside the loop. The code
became a lot slower, taking minutes instead of seconds to run, but
memory usage is still high - in fact, even higher than in the first
run (I killed it at 5 minutes and more than 700 MB memory).

Am I doing something wrong?

Thanks,
Osma Suominen

andrew cooke

unread,
May 10, 2012, 3:46:48 AM5/10/12
to le...@googlegroups.com

the constructor should be *outside* the loop.

i think you're right - lepl will cache data to improve speed on repeated
parses and that should be disabled for this library (that's also why repeating
a previous matcher consumes no more memory).

i'll test and do a new release this weekend, hopefully, but if you want to fix
things yourself and can access source, modify

matcher.config.compile_to_re()

to be

matcher.config.compile_to_re().no_memoize()

in _matcher_to_validator in lepl.apps.rfc3696

sorry about that + thanks for the report,
andrew
> --
> You received this message because you are subscribed to the Google Groups "lepl" group.
> To post to this group, send email to le...@googlegroups.com.
> To unsubscribe from this group, send email to lepl+uns...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/lepl?hl=en.
>

andrew cooke

unread,
May 13, 2012, 4:50:30 PM5/13/12
to le...@googlegroups.com


OK, there's a new release (5.1.2) that disables memoization making the RFC3696
package much more suitable for long-running processes.

Andrew
Reply all
Reply to author
Forward
0 new messages