Python parsing taking too much load on CPU and Increasing Memory

672 views
Skip to first unread message

Yogesh Kamble

unread,
Apr 5, 2014, 5:15:07 AM4/5/14
to pytho...@googlegroups.com
Hi,
    I have around 1700 XML files which I parse every day,
    My program did following things
    1) Create generator object of XML files list as xml_file_generator.
    2) while(True):
           try:
               xml_file = xml_file_generator.next()
               with open(xml_file, "r") as f:
                      xml_data = f.read()
               soup = BeautifulSoup(xml_data)
               tag_list = soup.findAll("tr")
               for tag in tag_list:
                    #Get contents from tag object
                    #Add content to Mysql database using Django Model.(before inserting database it also check data is present based on uniqueness if present then it doesn't insert. 
           except StopIteration:
                 break.

    
  This above program is running for last 1.5 days and also increasing memory load, I am not able to track why it's taking load on memory and cpu? because I am using generator and one by one I am just parsing xml after some time program becomes too slow. how can I optimize code. I tried to use ElementTree instead of BeautifulSoup but still result is same. 

At some point while running program system also run "kswapd0" command I have no clue what is going on?


Thin Rhino

unread,
Apr 5, 2014, 4:30:56 PM4/5/14
to Python Pune

When you do a file.read (), it reads the entire file into memory. If you do not do a file.close(), data you have read from the file will still stick in the memory.

I suggest you try putting a file.close() after reading the file.

Something like


while(True):
           try:
               xml_file = xml_file_generator.next()
               with open(xml_file, "r") as f:
                      xml_data = f.read()
               soup = BeautifulSoup(xml_data)
               tag_list = soup.findAll("tr")
               for tag in tag_list:
                    #Get contents from tag object
                    #Add content to Mysql 

                f.close()

Cheers
Thinrhino

--
You received this message because you are subscribed to the Google Groups "Python Pune" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pythonpune+...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Venkatesh Halli

unread,
Apr 5, 2014, 9:29:51 PM4/5/14
to pytho...@googlegroups.com


On 06-Apr-2014 2:00 AM, "Thin Rhino" <thin...@gmail.com> wrote:
>
> When you do a file.read (), it reads the entire file into memory. If you do not do a file.close(), data you have read from the file will still stick in the memory.
>
> I suggest you try putting a file.close() after reading the file.
>
> Something like
>
> while(True):
>            try:
>                xml_file = xml_file_generator.next()
>                with open(xml_file, "r") as f:
>                       xml_data = f.read()
>                soup = BeautifulSoup(xml_data)
>                tag_list = soup.findAll("tr")
>                for tag in tag_list:
>                     #Get contents from tag object
>                     #Add content to Mysql 
>                 f.close()
>

Using 'with' would ensure that the file is closed unless there's an exception.

steve

unread,
Apr 6, 2014, 1:24:39 AM4/6/14
to pytho...@googlegroups.com
Firstly, why are you doing the 'while(True)+try/except' ? That seems highly
suspect. If you xml_file_generator is indeed a generator object, and the intent
is to simply process a whole batch of files until the batch is exhausted, you
most likely don't need the 'while(True)...', just iterate over the generator. Eg:

>>> l = (fl for fl in os.listdir('.'))
>>> l
<generator object <genexpr> at 0x7f1980cbf190>
>>> for i in l:
... print i
...

Secondly, IIRC, you can pass a file handle to BeautifulSoup instead of a string.
This would avoid the need to read the entire file contents in memory at one go
and also would close the file when the object is reassigned (...also, IMHO, it
feels more pythonic), So do a:

with open(xml_file) as fl:
soup = BeautifulSoup(fl)
...
...

Note also, that you might want to put the the other stuff 'inside' the 'with'
block ...or create your own context manager to clean up (ie: either call
gc.collect() or use soup.decompose()[1])


> At some point while running program system also run "kswapd0" command I have no
> clue what is going on?
>
kswapd is simply the process on a linux system that is responsible for swap
memory management. It is run periodically by the kernel. If you've noticed that
kswapd is run more frequently when you script runs, it is possible that your
script is consuming most of the memory on the system and might indicate a memory
leak ...yeah, a memory leak in Python -- it happens ..specifically when using
something like MySQLDb :

https://www.google.com.sg/search?q=mysqldb+memory+leak
https://www.google.com.sg/search?q=mysqldb+memory+leak+cursor

without more code for the mysql processing bits, it is hard to say. I'd be wary
of pointing fingers there tho', unless you have eliminated everything else.

If you still don't know what's happening, profiling might help. Check out the
profile/cprofile module and something like memory_profiler.

Finally, here's a good general purpose, not directly related by relevant link:
http://www.huyng.com/posts/python-performance-analysis/

hth,
cheers,
- steve


[1]
http://stackoverflow.com/questions/11284643/python-high-memory-usage-with-beautifulsoup

Yogesh Kamble

unread,
Apr 9, 2014, 5:34:36 AM4/9/14
to pytho...@googlegroups.com
Found the Issue. Actually I am using Django Model to insert data to database and each time I use Model object to query database django will store query in it's query list (In short in system memory)  as I am parsing large rss file and each rss file contain around 500 data and 500 times I am using django model to insert data to database and django keeps on storing query in it's query set and so memory is increasing.

Refer: https://docs.djangoproject.com/en/1.6/faq/models/#why-is-django-leaking-memory

I know my program is not related to web application but I am just importing django model in my python program.


--
You received this message because you are subscribed to the Google Groups "Python Pune" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pythonpune+unsubscribe@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages