Hi all
I'd like to use the LEPL rfc3696 module for URL/URI validation. But
when I added validation into my application (which processes RDF data
with many URLs, some of them broken), its memory usage jumped through
the roof. It seems to me that LEPL leaks a significant amount of
memory when validating URLs.
This simple test script that validates 10000 generated URLs takes
about 500MB memory on my system (Ubuntu 12.04 amd64, Python 2.7.3,
LEPL 5.1.1 installed via easy_install):
#!/usr/bin/env python
from lepl.apps.rfc3696 import HttpUrl
URLS = 10000
print "validating %d URLs" % URLS
validator = HttpUrl()
for i in xrange(URLS):
url = "
http://example.org/%d" % i
validator(url)
print "done, press enter"
raw_input()
If I change the script to validate the same URL over and over, memory
usage goes back to normal. So maybe LEPL is storing (fragments of?)
the URLs somewhere. In this case I'm only interested in the validation
result (True/False), though. I would expect GC to reclaim any memory
after validation.
I also tried moving the HttpUrl constructor inside the loop. The code
became a lot slower, taking minutes instead of seconds to run, but
memory usage is still high - in fact, even higher than in the first
run (I killed it at 5 minutes and more than 700 MB memory).
Am I doing something wrong?
Thanks,
Osma Suominen