I've been trying to figure out that causing the memory usage issues I
noticed while moving t.e.o to trunk. Since this seems to be limited to
larger trac installations I created a new environment only laptop and
inserted 5000 (t.e.o has almost 7000 tickets) new tickets [1].
With this amount of tickets the performance is pretty terrible for some
pages. Take "/report/1" for example, on my laptop it takes about 33
seconds to generate 4.7MB of HTML. What's even worse is that the trac
process' memory footprint (resident) balloons to around 220MB.
As far as I can tell most of this memory is used by Genshi while
serializing the template during the "stream.render()" method call in
chrome.py.
I've created a linux-only patch[2] which monitors the resident memory
footprint before and after the "stream.render()" call. Four consecutive
"/report/1" requests gives the following output:
before render(): 51888 kB
after render(): 209456 kB
before render(): 208936 kB
after render(): 221792 kB
before render(): 222040 kB
after render(): 222128 kB
before render(): 222128 kB
after render(): 222128 kB
And since python 2.5.2 never seems to return "free" and garbage
collected memory back to the operating system this memory is bound to
this particular process until it is restarted.
This is bad news for people running multiple long running processes
since after a while all processes will have allocated enough ram for the
most expensive page.
So far I've been unable to find a way to minimize the memory usage of
genshi's serialization step. But unless we find any way to do that I'm
afraid we might have to consider paging or some other technique to avoid
these expensive pages.
Cheers,
Jonas
[1]: http://pastebin.com/fc653f04
[2]: http://pastebin.com/f50ce659b
> And since python 2.5.2 never seems to return "free" and garbage
> collected memory back to the operating system this memory is bound to
> this particular process until it is restarted.
>
No, Python 2.5.2 is able to do that. If instead of looking only at
before/after memory usage, you monitor the memory used while processing
the request, you'll see that there's a peak usage (which I estimate at
around 350MB, given your numbers above are close to mine). So that's
130MB returned to the system. That's not enough I agree, but it's
probably not because Python can't give back memory to the system, rather
because there are some leaks which prevent the memory to be freed.
> This is bad news for people running multiple long running processes
> since after a while all processes will have allocated enough ram for the
> most expensive page.
>
> So far I've been unable to find a way to minimize the memory usage of
> genshi's serialization step. But unless we find any way to do that I'm
> afraid we might have to consider paging or some other technique to avoid
> these expensive pages.
>
There is a good patch for report paging (#6127 / #216), so I'll move
that up in the queue.
But on the other hand, there might well be a report specific issue (or a
Genshi issue uncovered by the report_view.html template), as I also have
very big numbers (memory and time) for the reports in my tests.
-- Christian
Are you sure? If I monitor the process after it has been idle for
several minutes the command "top" shows the resident memory usage as
220+MB. So as far as I can tell nothing is returned back to the OS.
Which OS and Python version did you test this on?
>> This is bad news for people running multiple long running processes
>> since after a while all processes will have allocated enough ram for the
>> most expensive page.
>>
>> So far I've been unable to find a way to minimize the memory usage of
>> genshi's serialization step. But unless we find any way to do that I'm
>> afraid we might have to consider paging or some other technique to avoid
>> these expensive pages.
>>
>
> There is a good patch for report paging (#6127 / #216), so I'll move
> that up in the queue.
> But on the other hand, there might well be a report specific issue (or a
> Genshi issue uncovered by the report_view.html template), as I also have
> very big numbers (memory and time) for the reports in my tests.
For what it's worth /query and /timeline also have large memory
footprints for larger result sets.
/ Jonas
Yeah, I didn't say the memory was release after some time, I said that
the memory usage you witnessed was probably more like:
...
before render(): 209MB
*while rendering:* 350MB
after render(): 209MB
(for that, I use some very advanced memory monitoring tool, :-)
monitor() { while true; do cat /proc/$1/status | grep Vm; sleep 1;
clear; done ; }
I test with Linux 64-bits (SLES9), Python 2.5.1.
>>> This is bad news for people running multiple long running processes
>>> since after a while all processes will have allocated enough ram for the
>>> most expensive page.
>>>
>>> So far I've been unable to find a way to minimize the memory usage of
>>> genshi's serialization step. But unless we find any way to do that I'm
>>> afraid we might have to consider paging or some other technique to avoid
>>> these expensive pages.
>>>
>>>
>> There is a good patch for report paging (#6127 / #216), so I'll move
>> that up in the queue.
>> But on the other hand, there might well be a report specific issue (or a
>> Genshi issue uncovered by the report_view.html template), as I also have
>> very big numbers (memory and time) for the reports in my tests.
>>
>
> For what it's worth /query and /timeline also have large memory
> footprints for larger result sets.
>
>
Point taken. There are also patches for timeline paging, but I'm not
sure about them yet.
-- Christian
We are using mod_python with python 2.4 on apache 2.2.x on Windows.
Would we be hit by the leak when we upgrade to 0.11 beta? Or does this
only affect standalone trac servers?
Regards
Norbert Unterberg
Yes.
> We are using mod_python with python 2.4 on apache 2.2.x on Windows.
> Would we be hit by the leak when we upgrade to 0.11 beta? Or does this
> only affect standalone trac servers?
Again, yes. However, the extent to which you'll be affected seems to be
dependent on how many tickets you have and how heavily accessed your
site is.
Internally, I run multiple trac instances using tracd. One site that
only uses the wiki is used by about ~50 everyday and doesn't show the
memory growth. The site that uses tickets only has ~300 and memory is
still fairly low (~28megs).
So, if you have a faily small and infrequently hit site, you'll be fine.
-John
Jonas, et al:
I hate to break it to you, but this problem is not just related to
larger trac installs. I just installed a brand new trac for someone,
and built it with the 6669 rev. This trac has 3 tickets, but is failing
when running in a vm with 512 mb of memory under virtuozzo. It is
sucking up so much memory that I cannot even run exit (no memory for
process). This is on Apache 2 with mod_python.
All I did was check in a couple cake-php instances, and boom when I
tried updating the tickets.
I don't know where to start in helping you find the problem but thought
I'd report it.
At work we are running a fairly large installation with a couple
thousand tickets on an older 0.11 (I'd have to check the version) and no
problems whatsoever.
-Scott
Forgot to mention: Genshi 0.5-r803.
-Scott
Any big svn repository involved here?
> All I did was check in a couple cake-php instances, and boom when I
> tried updating the tickets.
>
> I don't know where to start in helping you find the problem but thought
> I'd report it.
>
For which request does that happen? A report? The timeline? Please try
to find the "trigger" which makes it explode.
Could you eventually try to run it with tracd?
> At work we are running a fairly large installation with a couple
> thousand tickets on an older 0.11 (I'd have to check the version) and no
> problems whatsoever.
>
That would be very interesting. Please also report which Genshi version
you're using (if Genshi was already used with the 0.11dev you're using
at work).
-- Christian
Brand new svn - only thing in it are two copies of cakephp, and a couple
of unix shell scripts. Python 2.4 by the way.
>> All I did was check in a couple cake-php instances, and boom when I
>> tried updating the tickets.
>>
>> I don't know where to start in helping you find the problem but thought
>> I'd report it.
>>
>
> For which request does that happen? A report? The timeline? Please try
> to find the "trigger" which makes it explode.
> Could you eventually try to run it with tracd?
I would be more than willing to experiment. In this case, I went
into trac to review the ticket that was updated (to make
sure the svn hook was working), and it blew up. Nowhere near
timeline or report. Just view tickets.
I will try to "make" it happen tonite and see what I can find out.
I was gonna try running under cgi and also under tracd to see if
it makes a difference.
>> At work we are running a fairly large installation with a couple
>> thousand tickets on an older 0.11 (I'd have to check the version) and no
>> problems whatsoever.
>>
>
> That would be very interesting. Please also report which Genshi version
> you're using (if Genshi was already used with the 0.11dev you're using
> at work).
I'll have to look on the work server... looks like that trac merge was
from 10/4/2007... I'll have to go look up the revision number. Genshi
is 0.4.4, python 2.4.
Let me know what else you need.
-Scott
> >
mod_python might be the key here. This memory problem is per Trac
process. And when using mod_python you might end up with tens or even
hundreds of Trac/apache processes unless you've configued apache
carefully. Each apache/trac process will after a while allocate about
20-80MB each for smaller Trac installs and and 100-500MB for very large
installs.
So when running on servers with a limited amount of ram you're probably
better of using the fast_cgi frontend or tracd behind mod_proxy.
> All I did was check in a couple cake-php instances, and boom when I
> tried updating the tickets.
>
> I don't know where to start in helping you find the problem but thought
> I'd report it.
>
> At work we are running a fairly large installation with a couple
> thousand tickets on an older 0.11 (I'd have to check the version) and no
> problems whatsoever.
As Christian said in another thread, the exact revision number is very
interesting. Is that server also a 512MB vm running mod_python?
/ Jonas
Unfortunately, when it bombs, it bombs so hard the only thing I can do
is reboot the server. Not enough memory to do anything else.
> So when running on servers with a limited amount of ram you're probably
> better of using the fast_cgi frontend or tracd behind mod_proxy.
I'm gonna give fast_cgi a try later tonite or tomorrow. I'm also
trying to find out if I can get more ram for the vm, and if that makes
any diff at all.
>> All I did was check in a couple cake-php instances, and boom when I
>> tried updating the tickets.
>>
>> I don't know where to start in helping you find the problem but thought
>> I'd report it.
>>
>> At work we are running a fairly large installation with a couple
>> thousand tickets on an older 0.11 (I'd have to check the version) and no
>> problems whatsoever.
>
> As Christian said in another thread, the exact revision number is very
> interesting. Is that server also a 512MB vm running mod_python?
It is whatever version was current on 10-4-07. Now I just have to get
into trac, and see what that was. It is a VE, running under OpenVZ. I
haven't checked how much ram was allocated... let me go look... yeah,
512MB of RAM allocated.
I am gonna try a couple different things:
a) fastcgi
b) back to older 4.4 genshi
c) more ram in the VE
I'll let you know what combinations seem to fix it.
-Scott
> >
This is definitely very interesting and helpful. I did some further
testing and noticed that the behavior differs between normal str strings
and unicode strings:
>>> resident_size()
3780
>>> a = ["%i" % i for i in xrange(2**22)]
>>> resident_size()
239580
>>> del a
>>> resident_size()
4128 <-- Most memory returned to the os
>>> a = [u"%i" % i for i in xrange(2**22)]
>>> resident_size()
434532
>>> del a
>>> resident_size()R
401760 <-- Almost nothing returned to the os
This shows that with Python 2.5.2 `unicode` objects unlike `str` do not
return memory when freed. This is bad news for Genshi which uses unicode
and Markup (unicode subclass) objects heavily.
Right now I'm trying to figure out if there is any way to make our
speedup.c version of Markup not use this "save list" feature
/ Jonas
Unfortunately, when it bombs, it bombs so hard the only thing I can do
is reboot the server. Not enough memory to do anything else.
> So when running on servers with a limited amount of ram you're probably
> better of using the fast_cgi frontend or tracd behind mod_proxy.
I'm gonna give fast_cgi a try later tonite or tomorrow. I'm also
trying to find out if I can get more ram for the vm, and if that makes
any diff at all.
That's worth a try, but i think it's set at default, which is only 5
processes... I'll go look.
-Scott
>
> On 3/10/08, *Scott Sharkey* <ssha...@linuxunlimited.com
In current Python releases the int and float lists are not garbage collected
and returned to the OS as the other parts do in 2.5.
--
Jeroen Ruigrok van der Werven <asmodai(-at-)in-nomine.org> / asmodai
イェルーン ラウフロック ヴァン デル ウェルヴェン
http://www.in-nomine.org/ | http://www.rangaku.org/
Happiness is the absence of the striving for happiness...
I knew that, of course (slapping forehead)... thanks for reminding me...
I'm so focused on the trees that I'm not seeing the forest <grin>
-Scott
OK, I did find the version number of the one at work - it's Rev 6049,
running with python 2.4.3-19, mod_python 3.2.8-3.1 and apache 2.2.3-7,
and using genshi 0.4.4 under CentOS 5. It's been up ~200 days with over
2000 tickets in the db, and lots of code in svn. No problems at all,
and the apache processes are pretty steady at around 45-50mb each.
> I am gonna try a couple different things:
> a) fastcgi
Didn't try this yet
> b) back to older 4.4 genshi
Will try this next.
> c) more ram in the VE
Bumped RAM up to 1GB, seems to be stable (so far), with the memory usage
levelling out around 600MB. Of course, I don't have any users on this
system yet...
> I'll let you know what combinations seem to fix it.
No blowups yet today since I bumped the ram. I'm gonna try some things
and keep an eye on the process sizes to see what happens.
What do you want me to measure? I may try fastcgi, mod_wsgi, and
dropping back to genshi 0.4.4 as experiments. Is the apache process size
what you want monitored?
Thanks.
-scott
> it's probably not because Python can't give back memory to the system,
> rather because there are some leaks which prevent the memory to be
> freed.
This may be well known to everyone here, but -- just in case it's not --
I'm going to point out that this sort of runaway memory problem can be
caused by at least two kinds of "leaks:"
* "True leaks" that involve 'C' code failing to free allocated memory or
decrement incremented reference counts. True leaks can only be fixed
by the Python (or extension module) developers and worked around by
Trac devs
* "False leaks" in Python code, where one unintentionally keeps objects
alive that should have been freed. One way to produce such a "false
leak" is to have a system where all objects of a certain class are
tracked in a global dict somewhere. For example:
foo_by_name = {} # look up Foo objects by name
class Foo:
def __init__(self, name):
self.name = name
foo_by_name[name] = self
False leaks are typical problems in software that relies on GC and in
this case can only be fixed by Trac devs.
Sorry if this is old news to anyone.
--
Dave Abrahams
Boost Consulting
http://boost-consulting.com
So I've examined the /report/1 case again.
My tests are done on old t.e.o. clone, postgresql backend, Python 2.5.1
on Linux x86_64 (SLES9), using tracd in single-threaded mode, permission
cache disabled (i.e. the setup described in #6614). Trac r6677 and
Genshi r801.
Note that in the following, I give the numbers for VmRSS but the
increase seen is due to the corresponding increase in VmData, i.e. the heap.
I started with a simple /wiki/WikiStart request (full refresh), and this
gave me the following for VmRSS, noted after the request and once the gc
has finished running:
28144 kB
28252 kB 28252 kB 28252 kB
28256 kB 28256 kB 28256 kB
28260 kB 28260 kB
28268 kB 28268 kB 28268 kB
28284 kB 28284 kB
28292 kB 28292 kB
28304 kB
28316 kB 28316 kB 28316 kB
28328 kB ...
So it seems that there's a leak going on, but quite a small one (the
HTML page size is only 7.7 kB).
While rendering the page, the VmRSS peaked at approx. 31740 kB, for each
request.
The interesting thing is that the live object count stayed constant all
the time at 47201.
Now repeatedly querying /report/1 (full refresh) gives me the following
numbers:
41208 kB
43632 kB
45240 kB
46040 kB
47096 kB
48700 kB
49504 kB
50824 kB
...
59408 kB
Here the leak is quite more visible (that report 1 has 782 or "real"
data, translating into 850.3 kB of HTML).
The VmRSS peaked at approx 63500 kB, for each request.
And here as well, the object count remained constant at 50107.
> ...
> before render(): 51888 kB
> after render(): 209456 kB
> before render(): 208936 kB
> after render(): 221792 kB
> before render(): 222040 kB
> after render(): 222128 kB
> before render(): 222128 kB
> after render(): 222128 kB
>
It seems that you're reaching the "peak usage" far more quickly than me,
but the end result seems to be the same: at some point, data accumulates
in the heap until the RSS somehow grows up to its peak usage. I was not
able to make it go beyond that, but maybe I was not patient enough...
What I can say is that when the memory usage is above 32MB (the peak
memory usage for the /wiki/WikiStart request), repeatedly querying for
/wiki/WikiStart don't show any memory increase anymore.
> And since python 2.5.2 never seems to return "free" and garbage
> collected memory back to the operating system this memory is bound to
> this particular process until it is restarted.
> This is bad news for people running multiple long running processes
> since after a while all processes will have allocated enough ram for the
> most expensive page.
>
> So far I've been unable to find a way to minimize the memory usage of
> genshi's serialization step.
It's not only the serialization step, apparently.
When retrieving the above /report/1 in .csv format, I get:
31656 kB
31748 kB
32860 kB
32924 kB
33000 kB
33052 kB
etc.
No peak memory usage seen, but regular increase of VmRSS and again,
constant object count (44385 here).
<warning: the following may or may not be of interest>
I join the dump of the live objects for the regular /report/1
(report-1.log) and the /report/1?format=csv (report-1-csv.log),
classified by type (as output by the debug-mem-growth.diff on #6614).
What is strange is that in the .csv case, trac.resource.Resource objects
seem to be leaked. I first suspected the PermissionCache, but after
doing some normal /report/1 queries after the .csv one
(report-1-csv-and-html-again.log), that PermissionCache disappears but
not the 215 Resource objects (and why 215, I have 782 tickets?). Weird.
</you've been warned>
-- Christian
PS: trying the patch I wrote for genshi #190 or trying to not use
itertools.chain didn't provide any notable difference, neither did
disabling _speedups.so make one.
1) Not related to Trac at all
2) Not related to any of the things discussed in this thread
3) You are using stackless, so you should probably ask on a stackless
list
--Noah