Python's memory hogging

1,790 views
Skip to first unread message

Vlad K.

unread,
May 5, 2012, 7:42:34 PM5/5/12
to pylons-...@googlegroups.com

Hi all.


As I understand it Python won't release internally released memory back
to OS. So it happens in my Pyramid app that has to process/construct
rather largish XML files occasionally, the memory consumption jumps to
several hundred MB and stays there for good or until I recycle the
processes (uwsgi reload). This is pretty bad in my case because I'm
forced to have a rather large memory headroom on the server just to
handle those peaks which comprise less than 1% of total requests
throughout the day. Hell, it's way less than 1%. A single request for
such XML file takes anywhere from 1 to up to 10 seconds on this
particular VPS "hardware", and there are maybe a few dozen such requests
throughout the day (out of several thousand daily requests). So what
should really be a transient memory consumption peak lasting up to 10
seconds becomes permanent memory requirement (or until the uwsgi
processes are reloaded).

Now I know I can set max-requests for the uwsgi processes and it will
recycle automatically, but is there another solution? I find this
reluctance of Python to return memory to OS very annoying, right next to
the GIL.



--

.oO V Oo.

Chris McDonough

unread,
May 5, 2012, 8:37:11 PM5/5/12
to pylons-...@googlegroups.com
Commenting on how Python does or doesn't return memory to the OS is
above my current pay grade. Because I don't understand it, and have no
competence to try to "fix" it, I have to work around it. I usually try
to do that by writing code doesn't ask for hundreds of megabytes all in
one shot from the OS.

For example, maybe you can construct the XML in portions instead of
constructing a huge collection of strings that all reside in memory at once.

If there's a memory leak somewhere outside of my control, what I usually
do to work around it is to use supervisor + memmon to automatically
restart my processes when they use more than some amount of memory:

http://www.plope.com/Members/chrism/memmon_sample

This isn't immediately useful if your processes are spawned by uwsgi,
but might be useful if you choose to use a different frontend server
like Apache or nginx to proxy to a number of backend supervisor-managed
Pyramid processes.

- C

Vlad K.

unread,
May 5, 2012, 10:27:19 PM5/5/12
to pylons-...@googlegroups.com

I don't think it's a leak because the consumed memory is constant, ie.
the max requested at certain point in the life of the process, and it
stays there for same data set regardless of how often it is requested.

I'm not constructing the XML with strings directly but using lxml (and
not xml.etree builtin because I use much faster lxml with C extensions
for parsing and xsd validation as well), which is perhaps overkill and
yeah, I can see several ways to chop that up into smaller pieces and
append stuff to a temp file on disc and then return that with a
generator or even x-sendfile, but I was hoping for a more
"straightforward" solution not requiring rewrite.

I've got about 20 XML file formats to construct, each produced by its
own (pluggable) module because each defines its own format etc...
Switching to a file-based partial generation would mean a massive
rewrite. I guess this is one of the situations where "preemptive
optimization is evil" bit me because I _was_ concerned with performance
when I started building the system. Admittedly at that point I wasn't
aware of the Pythons memory retention "policy".

I was hoping there's a way to safely kill a wsgi process from within it,
I could do that only when such largish XML files are requested, or
something else not obvious to me. Doesn't have to be uwsgi, though.
Another way would be to remove XML generation from the wsgi application
altogether into a separate callable and do some subprocess.Popen magick
with minimum rewrite required. Is that even wise/advisable under
(u)wsgi? Spawning external processes?


Thanks,

.oO V Oo.

Roberto De Ioris

unread,
May 6, 2012, 1:31:23 AM5/6/12
to pylons-...@googlegroups.com

>
>
> I was hoping there's a way to safely kill a wsgi process from within it,
> I could do that only when such largish XML files are requested, or
> something else not obvious to me. Doesn't have to be uwsgi, though.


But why moving away from uWSGI if it already give you all of the features
you need to bypass your problem without installing other softwares ?
(remember, uWSGI is not about speed as lot of people still think, it is
about features)

As options:

--memory-report (report memory usage in logs after each request)
--reload-on-rss <n> (automatically reload when a process consume more than
<n> megs of rss memory)
--evil-reload-on-rss (same as --reload-on-rss but asynchronously called by
the master [dangerous])
--never-swap (force the uWSGI stack to not use swap memory, triggering OOM
in case of memory problems)

From the app itself:

uwsg.mem() -> returns memory usage of the current process
uwsgi.reload() -> trigger a graceful reload
uwsgi.workers() -> returns a dictionary with workers data (included memory
usage)


> Another way would be to remove XML generation from the wsgi application
> altogether into a separate callable and do some subprocess.Popen magick
> with minimum rewrite required.
Is that even wise/advisable under
> (u)wsgi? Spawning external processes?
>

This is generally not a good programming behaviour (if you do it during a
request), but for your specific case it could be a good hack. Remember to
add --close-on-exec option to uWSGI avoiding your subprocesses to inherit
uWSGI sockets.

--
Roberto De Ioris
http://unbit.it

Malthe Borch

unread,
May 6, 2012, 2:55:21 AM5/6/12
to pylons-...@googlegroups.com
On 6 May 2012 04:27, Vlad K. <vl...@haronmedia.com> wrote:
> I've got about 20 XML file formats to construct, each produced by its own
> (pluggable) module because each defines its own format etc... Switching to a
> file-based partial generation would mean a massive rewrite. I guess this is
> one of the situations where "preemptive optimization is evil" bit me because
> I _was_ concerned with performance when I started building the system.
> Admittedly at that point I wasn't aware of the Pythons memory retention
> "policy".

If you want to explore the issue a bit further, you can get another
data point and possibly a solution by compiling Python without its own
memory manager. In this case, release really should mean release, or
there's something wrong with your operating system :-).

\malthe

Vlad K.

unread,
May 6, 2012, 5:42:47 AM5/6/12
to pylons-...@googlegroups.com


On 05/06/2012 07:31 AM, Roberto De Ioris wrote:
> But why moving away from uWSGI if it already give you all of the features
> you need to bypass your problem without installing other softwares ?
> (remember, uWSGI is not about speed as lot of people still think, it is
> about features)
>
> As options:
>
> --memory-report (report memory usage in logs after each request)
> --reload-on-rss<n> (automatically reload when a process consume more than
> <n> megs of rss memory)
> --evil-reload-on-rss (same as --reload-on-rss but asynchronously called by
> the master [dangerous])
> --never-swap (force the uWSGI stack to not use swap memory, triggering OOM
> in case of memory problems)

This. Is. Awesome. That's why I love uwsgi. I did look through the docs
but I guess I totally missed these settings.

So I am assuming --reload-on-* would recycle a worker when idle? After a
request that triggered that much resource consumption? Am I assuming
correctly that evil-reload would not wait for the worker to finish the
request? The docs are not clear on this.



> > From the app itself:
>
> uwsg.mem() -> returns memory usage of the current process
> uwsgi.reload() -> trigger a graceful reload
> uwsgi.workers() -> returns a dictionary with workers data (included memory
> usage)

Excellent!

> This is generally not a good programming behaviour (if you do it during a
> request), but for your specific case it could be a good hack. Remember to
> add --close-on-exec option to uWSGI avoiding your subprocesses to inherit
> uWSGI sockets.
>

Good advice, thanks.


.oO V Oo.





Vlad K.

unread,
May 6, 2012, 5:47:30 AM5/6/12
to pylons-...@googlegroups.com
This sounds like something I wouldn't want to do in production unless
I'm 100% sure I know what I'm doing, and I'm not :)

I didn't know it was possible and is definitely something I should toy
with outside of production to see how it works. Thanks for the suggestion.


.oO V Oo.


Roberto De Ioris

unread,
May 6, 2012, 6:33:16 AM5/6/12
to pylons-...@googlegroups.com

>
>
> On 05/06/2012 07:31 AM, Roberto De Ioris wrote:
>> But why moving away from uWSGI if it already give you all of the
>> features
>> you need to bypass your problem without installing other softwares ?
>> (remember, uWSGI is not about speed as lot of people still think, it is
>> about features)
>>
>> As options:
>>
>> --memory-report (report memory usage in logs after each request)
>> --reload-on-rss<n> (automatically reload when a process consume more
>> than
>> <n> megs of rss memory)
>> --evil-reload-on-rss (same as --reload-on-rss but asynchronously called
>> by
>> the master [dangerous])
>> --never-swap (force the uWSGI stack to not use swap memory, triggering
>> OOM
>> in case of memory problems)
>
> This. Is. Awesome. That's why I love uwsgi. I did look through the docs
> but I guess I totally missed these settings.
>
> So I am assuming --reload-on-* would recycle a worker when idle? After a
> request that triggered that much resource consumption?

The non-evil options trigger the check soon after each request, while the
"evil" can trigger the worker reload as soon as the master detect the
limit (so it could destroy the worker in the middle of a request)

Ben Bangert

unread,
May 6, 2012, 4:11:39 PM5/6/12
to pylons-...@googlegroups.com
On 5/5/12 4:42 PM, Vlad K. wrote:

> As I understand it Python won't release internally released memory back
> to OS.

This is actually no longer the case, I believe this behavior was fixed
in Python 2.6 or so. If you're curious about Python memory allocation,
there's also another interesting phenomenon in Python with memory
fragmentation. I found this presentation quite fascinating regarding how
Python handles memory:
http://revista.python.org.ar/2/en/html/memory-fragmentation.html

> So it happens in my Pyramid app that has to process/construct
> rather largish XML files occasionally, the memory consumption jumps to
> several hundred MB and stays there for good or until I recycle the
> processes (uwsgi reload).

Which version of Python? You might try running some of the memory
profiling tools, perhaps you still have a reference to the XML structure
you weren't aware of, and that's why it wasn't freed up.

Alternatively, if that's really bad, consider generating the XML with
generators so that you never have the entire thing in RAM at once.

> This is pretty bad in my case because I'm
> forced to have a rather large memory headroom on the server just to
> handle those peaks which comprise less than 1% of total requests
> throughout the day. Hell, it's way less than 1%. A single request for
> such XML file takes anywhere from 1 to up to 10 seconds on this
> particular VPS "hardware", and there are maybe a few dozen such requests
> throughout the day (out of several thousand daily requests). So what
> should really be a transient memory consumption peak lasting up to 10
> seconds becomes permanent memory requirement (or until the uwsgi
> processes are reloaded).
>
> Now I know I can set max-requests for the uwsgi processes and it will
> recycle automatically, but is there another solution? I find this
> reluctance of Python to return memory to OS very annoying, right next to
> the GIL.

There's a few interesting alternatives to generating large XML files
without the memory hit here:
http://stackoverflow.com/questions/3049188/generating-very-large-xml-files-in-python

--
Ben Bangert
(ben@ || http://) groovie.org

signature.asc

Graham Higgins

unread,
May 6, 2012, 8:58:25 PM5/6/12
to pylons-...@googlegroups.com
On Sun, 2012-05-06 at 01:42 +0200, Vlad K. wrote:

> As I understand it Python won't release internally released memory back
> to OS.

Glad to read that you have an acceptable solution.

The issue piqued my curiosity and I found this from June 2010: "we have
recently experienced problems with Python not giving back memory to the
OS in Linux. It reuses allocated memory internally, but never releases
free memory back to the OS":

http://pushingtheweb.com/2010/06/python-and-tcmalloc/

in which a comment points to a Python 3.3 (alpha 01) hg commit that
fixes the reported problem:

http://hg.python.org/cpython/rev/f8a697bc3ca8

The discussion on the issue ticket associated with the commit:

http://bugs.python.org/issue11849

contains further details, including an example of exactly how creating
large XML files in Python might cause the memory retention via lingering
refs --- just as Ben describes.

AIUI, the standard approach to handling large XML files is to use a
stream processor such as SAX.


finally:
self.curiosity.piqued = False

Cheers,

--
Graham Higgins

http://bel-epa.com/gjh/
signature.asc

John W. Shipman

unread,
May 6, 2012, 9:14:29 PM5/6/12
to pylons-...@googlegroups.com, John Shipman
On Mon, 7 May 2012, Graham Higgins wrote:

+--
| (deletia)
| The discussion on the issue ticket associated with the commit:
|
| http://bugs.python.org/issue11849
|
| contains further details, including an example of exactly how creating
^^^^^^^^
^^^^^^^^
^^^^^^^^
| large XML files in Python might cause the memory retention via lingering
| refs --- just as Ben describes.
|
| AIUI, the standard approach to handling large XML files is to use a
| stream processor such as SAX.
+--

Since when does SAX have an OUTPUT option?! I just looked at
www.saxproject.org, and it seems to be all about reading.

If somebody has a nice general solution to *generating* large XML
files, as opposed to reading them, I'd love to hear it. I have
an application that builds a large HTML table from a database.
I'm currently using lxml (which is an ElementTree variant), and
it's limited by the memory available on the server.

I supposed I could roll my own low-memory XML generator,
but I can't believe no one has done this already!

Best regards,
John Shipman (jo...@nmt.edu), Applications Specialist, NM Tech Computer Center,
Speare 146, Socorro, NM 87801, (575) 835-5735, http://www.nmt.edu/~john
``Let's go outside and commiserate with nature.'' --Dave Farber

Vlad K.

unread,
May 7, 2012, 9:42:07 AM5/7/12
to pylons-...@googlegroups.com

On 05/06/2012 10:11 PM, Ben Bangert wrote:
> I found this presentation quite fascinating regarding how
> Python handles memory:
> http://revista.python.org.ar/2/en/html/memory-fragmentation.html

Oh yeah, I've seen that presentation before, it is very interesting.



> Which version of Python? You might try running some of the memory
> profiling tools, perhaps you still have a reference to the XML structure
> you weren't aware of, and that's why it wasn't freed up.
>
> Alternatively, if that's really bad, consider generating the XML with
> generators so that you never have the entire thing in RAM at once.

It's Python 2.6.6, default on CentOS 6.2, 32-bit. I'll check with a
profiler to see if anything lingers in the lxml namespace because the
objects I'm creating directly live only within the view handler.

I'm not sure I can use generators directly on XML because the XML is
basically a huge list of dictionaries taken from database, processed and
transformed to adequate XML data, and it happens occasionally that some
exception occurs that MUST prevent generation of the XML, ie. return
http error response, rather than 200 and broken XML. The only
alternative is output to a temp file and then use generators to feed
that file back to client.

With that, I assume I can put a tempfile.TemporaryFile object in env and
expect it to be closed/destroyed AFTER the request is complete?


> There's a few interesting alternatives to generating large XML files
> without the memory hit here:
> http://stackoverflow.com/questions/3049188/generating-very-large-xml-files-in-python
>

Interesting, thanks, and that's just a variant of the tempfile approach.


.oO V Oo.




Jonathan Vanasco

unread,
May 7, 2012, 11:37:22 AM5/7/12
to pylons-discuss
fwiw, I used to run into issues like this a lot under mod_perl. the
apache process would lay claim to all the memory it ever used until a
restart ( or max children is reached ).

I used a few workarounds when i needed large data processing :
- i called an external process and collected the results
- i used a semaphore in the database or memcached that the webapp
would poll for results
- if i needed the process to run on the same codebase, i'd run another
instance of apache on an alternate port and then proxy only the 'large
requsts' to there
- eventually i would refactor the code to use a SOA setup and have a
dedicated daemon handle the large stuff.

Vlad K.

unread,
May 7, 2012, 11:50:47 AM5/7/12
to pylons-...@googlegroups.com

On 05/07/2012 05:37 PM, Jonathan Vanasco wrote:
> - eventually i would refactor the code to use a SOA setup and have a
> dedicated daemon handle the large stuff.


But doesn't that suffer from the same set of problems? If the (daemon)
process persists in memory, it doesn't matter if it's the wsgi app or a
SOA approach. Unless you put it on a different server, but it then
saturates memory there instead of the wsgi app server.


I think overall only a handful of solutions are acceptable:

1. build XML in smaller chunks to a tempfile, yield it back to client
2. subprocess.Popen a script external to the persisted wsgi app
3. reload process if it exceeds certain memory threshold, or
periodically after X requests


I also wonder what will happen when I get to PDF generation. I'll have
to generate similarly large PDFs with potentially thousands of pages
(prepress service). I have experience with ReportLab's tools, albeit for
much smaller sets and it works well. At least with that I am sure I can
produce smaller chunks and then concat them.





.oO V Oo.


Chris Lambacher

unread,
May 7, 2012, 1:21:35 PM5/7/12
to pylons-...@googlegroups.com
On Mon, May 7, 2012 at 11:50 AM, Vlad K. <vl...@haronmedia.com> wrote:
1. build XML in smaller chunks to a tempfile, yield it back to client

I have taken this approach in the past. Indeed if you use tempfile.TemporaryFile ( http://docs.python.org/library/tempfile.html#tempfile.TemporaryFile) this will work as you expect. In my particular case, I have the database generate the xml row-wise (i.e. if my structure is <records><record>...</record><record>...</record></records>) I have it generate a bunch of <record>...</record> strings a the rows selected out of the database. You can get the same effect with ElementTree (not sure about lxml because it has extra constraints) by doing (with more error checking):

fd = tempfile.TemporaryFile()
fd.write('<records>')
while True:
   records = [] 
   rows = cursor.fechmany(2000)
   if not rows:
       break

   for row in rows:
      record =  ET.Element('record')
      ET.SubElement(record, 'somechild').text = row.data
      records.append(ET.tostring(record))

    fd.write("".join(records))

fd.write('</records>')

note that the 2000 in the fetchmany should be modified to fit your memory and performance requirements.

I have also seen mention of several generation only XML libraries, but can't remember which ones, I also have not used any of them.

-Chris

--
Christopher Lambacher
ch...@kateandchris.net

Jonathan Vanasco

unread,
May 7, 2012, 6:42:43 PM5/7/12
to pylons-discuss


On May 7, 11:50 am, "Vlad K." <v...@haronmedia.com> wrote:
> On 05/07/2012 05:37 PM, Jonathan Vanasco wrote:
>
> > - eventually i would refactor the code to use a SOA setup and have a
> > dedicated daemon handle the large stuff.
>
> But doesn't that suffer from the same set of problems? If the (daemon)
> process persists in memory, it doesn't matter if it's the wsgi app or a
> SOA approach. Unless you put it on a different server, but it then
> saturates memory there instead of the wsgi app server.

Yes. but if you're at the size where it needs to be a SOA , you
pretty much need to have that daemon running nonstop. So you have a
single process that is 'eternally' allocated 256MB (or whatever) and
does all the grunt work - and you never run into an issue with
multiple app servers spiking up the memory usage.

Vlad K.

unread,
May 7, 2012, 6:48:59 PM5/7/12
to pylons-...@googlegroups.com

On 05/08/2012 12:42 AM, Jonathan Vanasco wrote:
> Yes. but if you're at the size where it needs to be a SOA , you
> pretty much need to have that daemon running nonstop. So you have a
> single process that is 'eternally' allocated 256MB (or whatever) and
> does all the grunt work - and you never run into an issue with
> multiple app servers spiking up the memory usage.
>



Right, and with that you basically keep the site up at all times even
though the little task spinner is spinning a bit longer during heavy load.

I'm considering something like this for another app. While not XML
related, it will produce rather largish PDFs on demand. I'm considering
Celery for that and put the grunts on separate machine(s).




.oO V Oo.


Graham Higgins

unread,
May 7, 2012, 7:50:20 PM5/7/12
to pylons-...@googlegroups.com, John Shipman
On Sun, 2012-05-06 at 19:14 -0600, John W. Shipman wrote:
> On Mon, 7 May 2012, Graham Higgins wrote:

> | AIUI, the standard approach to handling large XML files is to use a
> | stream processor such as SAX.

> Since when does SAX have an OUTPUT option?! I just looked at
> www.saxproject.org, and it seems to be all about reading.


Python's xml.sax library goes further

"Most users think of SAX as an XML input system, which is generally
correct; because, however, of some goodies in Python's SAX
implementation, you can also use it as an XML output tool."

from: "Using SAX for Proper XML Output"
http://www.xml.com/pub/a/2003/03/12/py-xml.html


specifically:



From the OP's first post:

"... Pyramid app that has to process/construct rather largish XML files"

and the second:

"I use much faster lxml with C extensions for parsing and xsd validation
as well"

Suggests that there's more than just generation going on.
signature.asc

Graham Higgins

unread,
May 7, 2012, 7:53:11 PM5/7/12
to pylons-...@googlegroups.com, John Shipman
On Tue, 2012-05-08 at 00:50 +0100, Graham Higgins wrote:
> specifically:

specifically:

http://docs.python.org/library/xml.sax.utils.html#xml.sax.saxutils.XMLGenerator
signature.asc

Vlad K.

unread,
May 15, 2012, 9:54:09 AM5/15/12
to pylons-...@googlegroups.com

Okay, followup on this problem.

I've now replaced one large lxml.etree root with chunked writing into a tempfile. The XML view basically does this (debug heapy output included):



hp = guppy.hpy()
print "================ BEFORE TEMPFILE"
print hp.heap()

xml_model = XMLModel()
xmlfile = tempfile.TemporaryFile()
xmlfile.write(xml_model.begin())        # write XML "doctype" and start root tag

for row in session.query(DataModel).filter(....).all():
    # turn single row into its XML representation using lxml.element to construct
    # it, and return as string with lxml.tostring(element), here to be written to tempfile

    xmlfile.write(xml_model.process_row(row))

xmlfile.write(xml_model.end())          # write closing root tag

print "================ BEFORE ITERATOR RETURN"
print hp.heap()

# Return file as app iterator
xmlfile.seek(0)
return Resopnse(app_itel=xmlfile, content_type="application/xml")




The problems I've encountered:

1. If I used SpooledTemporaryFile it eats up memory that is apparently Python's and thus not returned to OS, so I'll just use tmpfs to minimize disk writes on tens of thousands of rows
2. The whole application apparently uses 121M of memory as shown by top, while guppy/heapy says 18M of heap is used. What follows is the output of uwsgi.log, I've restarted the app and there is no
request to it except this single request for a large XML "file". The resulting output "file" is 7M.


================ BEFORE TEMPFILE
Partition of a set of 192674 objects. Total size = 16970368 bytes.
 Index  Count   %     Size   % Cumulative  % Kind (class / dict of class)
     0  80163  42  6728540  40   6728540  40 str
     1  44155  23  1673300  10   8401840  50 tuple
     2   7068   4  1554336   9   9956176  59 dict (no owner)
     3  10675   6   725900   4  10682076  63 types.CodeType
     4   1539   1   687936   4  11370012  67 type
     5    475   0   654424   4  12024436  71 dict of module
     6  11259   6   630504   4  12654940  75 function
     7   1539   1   614808   4  13269748  78 dict of type
     8   3544   2   262048   2  13531796  80 list
     9   1817   1   235916   1  13767712  81 unicode
<593 more rows. Type e.g. '_.more' to view.>
================ BEFORE ITERATOR RETURN
Partition of a set of 194294 objects. Total size = 18709172 bytes.
 Index  Count   %     Size   % Cumulative  % Kind (class / dict of class)
     0  80558  41  6739812  36   6739812  36 str
     1  44243  23  1676316   9   8416128  45 tuple
     2   7214   4  1596080   9  10012208  54 dict (no owner)
     3      1   0  1573008   8  11585216  62 sqlalchemy.orm.identity.WeakInstanceDict
     4  10675   5   725900   4  12311116  66 types.CodeType
     5   1539   1   687936   4  12999052  69 type
     6    475   0   654424   3  13653476  73 dict of module
     7  11264   6   630784   3  14284260  76 function
     8   1539   1   614808   3  14899068  80 dict of type
     9   3679   2   268960   1  15168028  81 list
<628 more rows. Type e.g. '_.more' to view.>



The idle, just started application is reported by top to use:

VIRT: 36216
RES: 23m
SHR: 5276


After single request to the above XML view:

VIRT: 135m
RES: 121m
SHR: 3692

I'm reporting the worker uwsgi process because the master parent process is unchanged (23m).


I am obviously doing something wrong or not understanding processes involved, because the heapy is showing little change before and after processing, total heap of 18M and top says 120M is eaten by that process, and the resulting XML "file" is only 7M large.


.oO V Oo.

Ceri Storey

unread,
May 17, 2012, 8:44:37 AM5/17/12
to pylons-...@googlegroups.com


On Tuesday, May 15, 2012 2:54:09 PM UTC+1, Vlad K. wrote:

Okay, followup on this problem.

I've now replaced one large lxml.etree root with chunked writing into a tempfile. The XML view basically does this (debug heapy output included):

[...]

The idle, just started application is reported by top to use:

VIRT: 36216
RES: 23m
SHR: 5276


After single request to the above XML view:

VIRT: 135m
RES: 121m
SHR: 3692

I'm reporting the worker uwsgi process because the master parent process is unchanged (23m).


I am obviously doing something wrong or not understanding processes involved, because the heapy is showing little change before and after processing, total heap of 18M and top says 120M is eaten by that process, and the resulting XML "file" is only 7M large.

As has been implied elsewhere on this thread, this is likely down to memory fragmentation. So, by and large, Python will request memory from the operating system via malloc, which requests memory from the kernel in chunks (more or less) with either the brk or mmap system calls. Now, these will normally be relatively large blocks compared to a single python object, normally 4k (a page) for mmap, or a single extending block in the case of brk. 

So, because the os layer will manage memory in bigger blocks than python does, what happens is that if you have one small object (say 100bytes) left allocated on a 4k allocation from the os, then naturally, you can't release that back to the OS, and Python can't optimise the memory usage (ie: compact the heap) to be able to do this. 

So, with the results, you're seeing, fragmentation is likely the cause.

Also, it's worth bearing in mind that heapy will only record objects allocated and tracked by python, which will of course be a subset of the total memory allocated to python by the os (which is what the virtual size (VIRT) measures). I'd guess that if you were to take heapy snapshots part way through creating the XML file, then you would see a bigger allocation of objects.

I hope this is somehow useful,

Ceri.

Vlad K.

unread,
May 17, 2012, 11:55:01 AM5/17/12
to pylons-...@googlegroups.com

On 05/17/2012 02:44 PM, Ceri Storey wrote:
> Also, it's worth bearing in mind that heapy will only record objects
> allocated and tracked by python, which will of course be a subset of
> the total memory allocated to python by the os (which is what the
> virtual size (VIRT) measures). I'd guess that if you were to take
> heapy snapshots part way through creating the XML file, then you would
> see a bigger allocation of objects.
>
> I hope this is somehow useful,
>

It is. :) I was just about to post an update. I have isolated the
problem to be with SQLA+psycopg2 somewhere, because of the large number
of rows queried from the database at once. Switching to tempfile did
reduce memory from one large ElementTree by some 20%, and the rest is
caused by ORM'ed data itself. There are workarounds I'm pursuing on the
SQLA list which basically involves reading least amount of data possible
at once so that the request does not handle tens of thousands of fully
related SQLA objects at once, but in subsets.

For reference:
https://groups.google.com/d/msg/sqlalchemy/yAlrzMAH-Yk/FiNzFknjezMJ


.oO V Oo.


Reply all
Reply to author
Forward
0 new messages