I've just upgraded t.e.o to Trac 0.11 beta1. So far everything appears
to be working but please let me know if you notice anything broken by
this change.
Cheers,
Jonas
I found a small issue with Opera, can anyone else with a 0.11 beta verify
this:
Go to newticket, click the component drop down menu:
For Opera 9.24 I see that the selected option jumps to the top choice. Even if
I select another option, click somewhere else in the chrome to unfocus from
the drop down menu and then click the drop down menu again it will jump to the
top option 'attachment'
For Opera 9.50 watch the 'general' selection get cleared. Even if I select
another option, click somewhere else in the chrome to unfocus from the drop
down menu and then click the drop down menu again it will clear the selection
again.
Under Firefox 2 this does not occur and general stays selected.
--
Jeroen Ruigrok van der Werven <asmodai(-at-)in-nomine.org> / asmodai
イェルーン ラウフロック ヴァン デル ウェルヴェン
http://www.in-nomine.org/ | http://www.rangaku.org/
In every stone sleeps a crystal...
I can confirm this for 9.24. But I only see this behaviour on t.e.o, not
on my local testsite. Even if I add enough components so that a scrollbar
is displayed, it works correctly. The only difference is that on my
testsite, there is enough room underneath the selectbox so that the
options are displayed below it, whereas on t.e.o. the options are
displayed above the select box. To me this indicates a Opera bug, not a
trac bug?
I noticed another detail however. When I add components in the admin page,
then click new ticket, the new components are not displayed in the
dropdown until I do a refresh of the whole page. Cache problem or
something?
- Endre
I've not yet identified which pages require this much memory but I would
not be surprised if it's some of the more expensive versioncontrol
pages.
As far as I know memory allocated by the python interpreter is almost
never returned to the operating system so a single memory hungry page
can case this type of problem when using long running processes such as
fastcgi and tracd.
As a comparison with Trac 0.10 each trac process on t.e.o never seems to
allocate more than 100MB. And this server is 64bit so memory usage
should be even lower on 32bit systems.
Some relevant reading:
http://wingolog.org/archives/2007/11/27/reducing-the-footprint-of-
python-applications
Cheers,
Jonas
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
22732 www-data 15 0 2625m 2.5g 3548 S 0 32.2 317:02.53 trac.fcgi
15142 www-data 15 0 511m 407m 3536 S 0 5.1 46:46.00 trac.fcgi
23190 www-data 15 0 437m 393m 3524 S 0 4.9 42:09.89 trac.fcgi
19822 www-data 15 0 216m 142m 3516 S 0 1.8 13:06.86 trac.fcgi
--Noah Kantrowitz
It's in trunk now, r6385.
Perhaps Noah and Jonas can also give this latest trunk a spin and see how it
handles itself.
All my testing could never get it above 30 MB since I guess my testset is too
small to trigger anything.
Christopher pointed me at Funkload and I started writing tests for this to
test Trac installs with it, so hopefully we can use some automated testing
soon as well to determine weak points.
I just wish there was a clearer way to deal with this programmatically instead
of jumping through many, many arcane hoops. I can debug C and related memory
issues easily, so I'll gladly receive hints/clues/whatsoever.
--
Jeroen Ruigrok van der Werven <asmodai(-at-)in-nomine.org> / asmodai
イェルーン ラウフロック ヴァン デル ウェルヴェン
http://www.in-nomine.org/ | http://www.rangaku.org/
Wisdom is the difference between knowledge and experience...
> -On [20080109 16:56], David Abrahams (da...@boost-consulting.com) wrote:
>>I think we already found and submitted a patch for this problem:
>>http://trac.edgewall.org/ticket/6581
>
> It's in trunk now, r6385.
>
> Perhaps Noah and Jonas can also give this latest trunk a spin and see how it
> handles itself.
>
> All my testing could never get it above 30 MB since I guess my
> testset is too small to trigger anything.
>
> Christopher pointed me at Funkload and I started writing tests for
> this to test Trac installs with it, so hopefully we can use some
> automated testing soon as well to determine weak points.
>
> I just wish there was a clearer way to deal with this
> programmatically instead of jumping through many, many arcane
> hoops. I can debug C and related memory issues easily, so I'll
> gladly receive hints/clues/whatsoever.
I think this is just one of the upshots of using a GC language where
"you don't have to worry about memory management." You just end up
with different kinds of worries.
--
Dave Abrahams
Boost Consulting
Just for the record: I tested latest trunk again but didn't notice any
improvement. This time there was more traffic on the server than last
week so it didn't take longer than about 30 minutes for the server to
start running out of memory. When I aborted this experiment each fcgi
process was using about 3 times more resident memory (325MB) compared to
0.10.x (around 100MB after running for a week).
Cheers,
Jonas
Erik
On Feb 1, 2008 9:17 PM, rupert....@gmail.com
You're right on both points. Maybe you could reopen #G166, for
completeness, so that this format leak can be fixed as well. But in my
tests, I've been careful to disable the _speedups extension, so it's not
related to THE leak ;-)
By the way, I'm not sure I found THE leak, but at least an important
one. I'm still trying to get my way through the generator chain in
Genshi to understand more precisely what the problem is, but things go
really bad memory wise once an exception is raised from an expression
evaluation (and it doesn't seem to be related to what we do with that
exception later at the Trac level). I'll update #6614 when I have new
findings.
-- Christian
I think I read somewhere that using yield and getting exceptions might be
problematic when it comes to leaving around stray references. Perhaps this
might help you?
--
Jeroen Ruigrok van der Werven <asmodai(-at-)in-nomine.org> / asmodai
イェルーン ラウフロック ヴァン デル ウェルヴェン
http://www.in-nomine.org/ | http://www.rangaku.org/
We have met the enemy and they are ours...
Great work closing in on the problem, Christian! I tried to reproduce
this with a somewhat smaller example, but failed so far (also,
unfortunately have very little time right now).
So what can this be aside from stack frame data hanging around in a
bad way? The various generators don't have much local data, and global
data shouldn't be used at all. I'm lost.
If you can reliably reproduce this problem with a local data set (do
you have some instructions maybe?), could you maybe try whether it
also happens with Genshi 0.4.4? I'd like to know whether this problem
is new in 0.5dev.
Thanks,
Chris
--
Christopher Lenz
cmlenz at gmx.de
http://www.cmlenz.net/
Unfortunately same goes for me (plus the fact that at times I needed to
take a break from those daunting debugging sessions and work on features
I needed, like the wiki rename and multirepos support).
Also, I still haven't tried to reproduce the issue on a smaller example,
but of course the problem might relate to the Trac/Genshi interaction
rather than to Genshi alone.
> So what can this be aside from stack frame data hanging around in a
> bad way? The various generators don't have much local data, and global
> data shouldn't be used at all. I'm lost.
>
From my very latest findings, it appears that there are two issues,
probably rooted in the same cause.
1. At the very least the `data` dictionary of every request gets
accumulated over time.
This can be hard to spot if this data corresponds to "cheap" pages, but
can amount to hundreds of Mb per request if that data keeps alive costly
stuff. Such expensive queries correspond for example to timeline
requests. An example is to look 3 years back in the timeline of an
environment setup to browse on a copy of the t.e.o repository, using the
direct-svnfs access type (i.e. no cache involved). No need to list
_lots_ of changesets, only the fact that we have to go through all the
changesets for getting there, accumulates data in pools that are kept
alive because the changesets themselves are kept alive (that in itself
can be considered as another bug at the svn_fs level).
But even if we fix the SVN issue or don't get there in the first place,
there's a kind of leak going on, for every request. That use case just
makes it *very* apparent.
2. If there's an exception raised from a code block (Expression or
Suite), it's even worse. The *whole traceback* gets accumulated, i.e.
all frames, so all objects referenced by local variables in any of the
frames up to the evaluate one, will be kept alive for the rest of the
lifespan of the process. In this case of course, the memory usage
increases even more dramatically than in case 1.
To be clear, when I say "accumulated" in the above, it's really that
those objects are considered live by the GC, not unreachable. I do a
gc.collect() with the SAVE_ALL flag set, after every request, and
collect() says we have no unreachable objects. This means that
apparently we're doing the right thing for our __del__ stuff.
<some-more-gory-details>
Now the weird thing is that when I debug the case (1) using a call to
gc.get_referrers() on the 'data' object, there are 3 referrers to it:
the local frame (fine), the list of live objects as obtained with
gc.get_objects() (also fine), plus a very curious one.
The last referrer is most of the time a dict object which looks like the
session, but is not the session - I've modified the Session so that it's
no longer a dict itself but an object delegating its dict like methods
to a self._dict dictionary. Even that new _dict field shows up as a key
in this dict-like session, so I'm sure the new code is picked (btw, I
did this change because of http://bugs.python.org/issue1469629). But it
happened a few time that this third object was an unrelated <Request
"..../trac.css"> object, and sometimes another list with a huge pile of
crap inside (not the live objects list). Weird. Not to mention that this
dict doesn't appear to even contain an entry for 'data'...
Ok, those were the very latest findings of yesterday evening (quite
late!) and I don't know how much I can trust get_referrers(), so it
might well be (just another) wrong track...
</some-more-gory-details>
> If you can reliably reproduce this problem with a local data set (do
> you have some instructions maybe?), could you maybe try whether it
> also happens with Genshi 0.4.4? I'd like to know whether this problem
> is new in 0.5dev.
>
Great advice, I'll try with 0.4.4.
Also, I'll post some of the patches I used for debugging case (1) and
trigger case (2) on #6614 later today.
-- Christian
Crap, I meant DEBUG_UNCOLLECTABLE and s/unreachable/uncollectable/ in
the above, as that's how I tested it on Linux (as reported in #6614),
but yesterday on Windows I used the above wrong flag, which means...
well, the whole case 1. was complete crap (except the remark about
get_changes() which hints about possible optimization in svn_fs.py).
This is the correct explicit_gc.diff patch which I should have used:
Index: trac/web/main.py
===================================================================
--- trac/web/main.py (revision 6489)
+++ trac/web/main.py (working copy)
@@ -390,6 +390,16 @@
finally:
if env and not run_once:
env.shutdown(threading._get_ident())
+ # Now it's a good time to do some clean-ups
+ import gc
+ gc.disable()
+ gc.set_debug(gc.DEBUG_UNCOLLECTABLE)
+ unreachable = gc.collect()
+ env.log.info("%d unreachable objects found.", unreachable)
+ uncollectable = len(gc.garbage)
+ if uncollectable:
+ del gc.garbage[:]
+ env.log.warn("%d uncollectable objects found.",
uncollectable)
def _dispatch_request(req, env, env_error):
resp = []
With that patch, there's no leak and constant memory usage for the
scenario 1.
Sorry for the confusion.
Case 2. still stands.
-- Christian
How bad is the leakage at this point?
I mean, might it make sense to release 0.11.0 now to get the release out the
door and continue to look for these more elusive cases for a 0.11.1 release?
I am just trying to get a sense of the choice between finally releasing 0.11
and stability for the end-users.
--
Jeroen Ruigrok van der Werven <asmodai(-at-)in-nomine.org> / asmodai
イェルーン ラウフロック ヴァン デル ウェルヴェン
http://www.in-nomine.org/ | http://www.rangaku.org/
All for one, one for all...
I'm a bit short on time these days, so I can't really say /when/ I'll be
able to sit down again and try to fix the issue with Genshi. On the Trac
side, it's quite clear what has to be done (looking at
http://trac.edgewall.org/ticket/6614#comment:18):
1. do an explicit garbage collection after each request
(http://trac.edgewall.org/attachment/ticket/6614/explicit_gc.diff)
2. tune the pool usage in the svn backend:
a. for the _history call
(http://trac.edgewall.org/attachment/ticket/6614/fix-_history.diff)
b. for listing long directories (#5782)
I can commit 1. and 2.a. shortly, and with that (plus using Python 2.5
on the server), things will start to look quite acceptable.
2.b. needs more work, but is not a blocker, I think, so it could well
wait 0.11.1 (or 0.12 if I end up reworking the Pool code entirely).
Now, if for any reason the server hits the problem with Genshi
(http://trac.edgewall.org/ticket/6614#comment:40), the memory usage will
increase dramatically again. To fix that one, I think the approach
outlined in the last paragraph of
http://trac.edgewall.org/ticket/6614#comment:38 will work, I only need
some more time (or someone else to do it, of course).
> I mean, might it make sense to release 0.11.0 now to get the release out the
> door and continue to look for these more elusive cases for a 0.11.1 release?
>
Well, it's a more a matter of fixing the Genshi issue for 0.5, then make
Trac 0.11 depend on that version. As I said in the comment:38 (linked
above), 0.4.x has the #G158 bug which makes Trac unreliable in
multithreaded setup, so I wouldn't recommend using it.
> I am just trying to get a sense of the choice between finally releasing 0.11
> and stability for the end-users.
>
Perhaps we can make another beta after the two patches on #6614 (listed
in 1. and 2.a above) are committed, and see how things work in practice?
-- Christian
By all means. :)
>2.b. needs more work, but is not a blocker, I think, so it could well
>wait 0.11.1 (or 0.12 if I end up reworking the Pool code entirely).
>
>Now, if for any reason the server hits the problem with Genshi
>(http://trac.edgewall.org/ticket/6614#comment:40), the memory usage will
>increase dramatically again. To fix that one, I think the approach
>outlined in the last paragraph of
>http://trac.edgewall.org/ticket/6614#comment:38 will work, I only need
>some more time (or someone else to do it, of course).
Well, I started yesterday looking at the remaining tickets for 0.11 in order
to close them, but it seems you went through half of it after Odd Simon and
me closed one. :)
I'll check the tickets full history when I get home.
>> I mean, might it make sense to release 0.11.0 now to get the release out the
>> door and continue to look for these more elusive cases for a 0.11.1 release?
>
>Well, it's a more a matter of fixing the Genshi issue for 0.5, then make
>Trac 0.11 depend on that version. As I said in the comment:38 (linked
>above), 0.4.x has the #G158 bug which makes Trac unreliable in
>multithreaded setup, so I wouldn't recommend using it.
Ok, but then releasing Genshi 0.5 becomes a blocking priority before
releasing Trac 0.11, since we need to get the eggs, especially for Windows
with speedups, out there.
>> I am just trying to get a sense of the choice between finally releasing 0.11
>> and stability for the end-users.
>
>Perhaps we can make another beta after the two patches on #6614 (listed
>in 1. and 2.a above) are committed, and see how things work in practice?
Sure, I think our encountered issues warrant another beta regardless.
--
Jeroen Ruigrok van der Werven <asmodai(-at-)in-nomine.org> / asmodai
イェルーン ラウフロック ヴァン デル ウェルヴェン
http://www.in-nomine.org/ | http://www.rangaku.org/
There is time in life for everything...
Jeroen Ruigrok van der Werven wrote:
>>> I mean, might it make sense to release 0.11.0 now to get the release out the
>>> door and continue to look for these more elusive cases for a 0.11.1 release?
>>>
>> Well, it's a more a matter of fixing the Genshi issue for 0.5, then make
>> Trac 0.11 depend on that version. As I said in the comment:38 (linked
>> above), 0.4.x has the #G158 bug which makes Trac unreliable in
>> multithreaded setup, so I wouldn't recommend using it.
>>
>
> Ok, but then releasing Genshi 0.5 becomes a blocking priority before
> releasing Trac 0.11, since we need to get the eggs, especially for Windows
> with speedups, out there.
>
Thanks to Alec who managed to write a Genshi-only test program
exhibiting the problem, we could narrow down the specific issue about
the memory constantly growing when an exception was raised from a
template to a Genshi bug. Well, I have the feeling it must rather be a
Python bug, but at least it's triggered at (and can be worked around at)
the Genshi level. See #G190.
But that issue is not the only problem we have w.r.t. to high memory
usage. I think it's still relatively easy to make the memory usage grow
to a high number where it will stay forever, even with Python 2.5. That
has to be improved further.
>>> I am just trying to get a sense of the choice between finally releasing 0.11
>>> and stability for the end-users.
>>>
>> Perhaps we can make another beta after the two patches on #6614 (listed
>> in 1. and 2.a above) are committed, and see how things work in practice?
>>
>
> Sure, I think our encountered issues warrant another beta regardless.
>
>
The current state of affair is however quite better than it used to be,
so we can probably prepare a beta2 even if #6614 is not completely fixed
at this point. Thoughts?
-- Christian
>
> Some updates on the elusive #6614 bug
>
> Jeroen Ruigrok van der Werven wrote:
>>>> I mean, might it make sense to release 0.11.0 now to get the
>>>> release out the
>>>> door and continue to look for these more elusive cases for a
>>>> 0.11.1 release?
>>>>
>>> Well, it's a more a matter of fixing the Genshi issue for 0.5,
>>> then make
>>> Trac 0.11 depend on that version. As I said in the comment:38
>>> (linked
>>> above), 0.4.x has the #G158 bug which makes Trac unreliable in
>>> multithreaded setup, so I wouldn't recommend using it.
>>>
>>
>> Ok, but then releasing Genshi 0.5 becomes a blocking priority before
>> releasing Trac 0.11, since we need to get the eggs, especially for
>> Windows
>> with speedups, out there.
>>
>
> Thanks to Alec who managed to write a Genshi-only test program
> exhibiting the problem, we could narrow down the specific issue about
> the memory constantly growing when an exception was raised from a
> template to a Genshi bug. Well, I have the feeling it must rather be a
> Python bug,
I'm not sure if this is related at all, since you're talking about
0.11 and I'm running 0.10-stable, but I have had a 2-year issue with
memory and cpu usage getting out of control at dsource.org. With some
help, gdb, and Python compiled w/ symbols, we tracked all the activity
to _sre.c file (Python 2.4.4 source, but issue is in 2.5 as well).
This is the regexp portion of Python and it seems to go into an
endless loop. CPU spikes and the system starts swapping. I actually
have a cronjob that restarts apache if utilization is too high... :(
We didn't figure out what part of Trac triggered this within Python,
but I thought I'd mention it for this thread. I had always assumed it
was my use of a CentralDatabaseManager customization that I've made to
the Trac code base. You're talking about high memory usage, and not
so much CPU, so I don't know if it's related or not...
> We didn't figure out what part of Trac triggered this within Python,
> but I thought I'd mention it for this thread. I had always assumed it
> was my use of a CentralDatabaseManager customization that I've made to
> the Trac code base. You're talking about high memory usage, and not
> so much CPU, so I don't know if it's related or not...
We've also had periodic runaway Trac processes. I just had to kill this
one:
PID USERNAME THR PRI NICE SIZE RES STATE C TIME WCPU COMMAND
84924 www 4 20 0 116M 2848K kserel 0 40.0H 98.73% python
This is with 0.11-dev.
--
Dave Abrahams
Boost Consulting
If this happens again, please attach to the process with gdb before
killing it.
The backtrace could give a hint about what's going on.
-- Christian
Will do.
This is what happened:
$ gdb
GNU gdb 6.1.1 [FreeBSD]
Copyright 2004 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB. Type "show warranty" for details.
This GDB was configured as "amd64-marcel-freebsd".
(gdb) attach 3302
Attaching to process 3302
/usr/src/gnu/usr.bin/gdb/libgdb/../../../../contrib/gdb/gdb/solib-svr4.c:1443: internal-error: legacy_fetch_link_map_offsets called without legacy link_map support enabled.
A problem internal to GDB has been detected,
further debugging may prove unreliable.
Quit this debugging session? (y or n) y
So now I need to try to figure out how to enable legacy link_map
support.
> If you answer "n" to quit and then "n" to creating a core file, it
> will continue
It did for me.
> (and seems to give reasonable information).
It didn't for me: a neverending backtrace with no symbols.
> On my FreeBSD box I get the same behavior as you by attaching without
> specifying the executable itself. Try:
>
> gdb -p $pid `which python`
Hmm, next time, then. Thanks.
> If you answer "n" to quit and then "n" to creating a core file, it
> will continue (and seems to give reasonable information).
>
> On my FreeBSD box I get the same behavior as you by attaching without
> specifying the executable itself. Try:
>
> gdb -p $pid `which python`
This trac process was occupying 400MB but 0% CPU when I stopped
it
$ gdb -p 94603 `which python`
GNU gdb 6.1.1 [FreeBSD]
Copyright 2004 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB. Type "show warranty" for details.
This GDB was configured as "amd64-marcel-freebsd"...(no debugging symbols found)...
Attaching to program: /usr/local/bin/python, process 94603
ptrace: Operation not permitted.
/usr/home/dave/94603: No such file or directory.
(gdb) bt
No stack.
(gdb)
One more note: There were two others weighing in at 225MB or so. These
processes linger for a good long while (may 30 sec for the longest)
after I stop lighttpd, but they will exit on their own eventually
without being explicitly killed.
It's also possible that you're hitting some Python "bugs" [1] triggered by
Genshi [2]. The Genshi ticket actually dsicusses two issues, one genuine
memory leak in Python's builtin groupby() and another covered by the
linked-to Python issue, where memory allocated to unicode objects is
never returned to the system.
The latter is "fixed" in Genshi r816 [3] and Trac r6757 [4], and the former
is fixed in Trac r6637 [5].
If you're not already running at least Trac r6637 and Genshi r816, you
should upgrade and see if your problem persists.
[1] http://bugs.python.org/issue2321,
[2] http://genshi.edgewall.org/ticket/190
[3] http://genshi.edgewall.org/changeset/816
[4] http://trac.edgewall.org/changeset/6756
[5] http://trac.edgewall.org/changeset/6637
--
Evolution: Taking care of those too stupid to take care of themselves.
Last time I mentioned a possible threading/python issue on FreeBSD I got
told on the Python list that FreeBSD's threading has caused issues for
Python before. Now, whether or not this is a problem within FreeBSD or
simply something else (read: Python itself), I cannot say.
--
Jeroen Ruigrok van der Werven <asmodai(-at-)in-nomine.org> / asmodai
イェルーン ラウフロック ヴァン デル ウェルヴェン
http://www.in-nomine.org/ | http://www.rangaku.org/ | GPG: 2EAC625B
Nothing yet from nothing ever came...
> -On [20080617 22:20], David Abrahams (da...@boostpro.com) wrote:
>>#0 0x000000080098b85c in pthread_testcancel () from /lib/libpthread.so.2
>>#1 0x0000000800982de7 in pthread_mutexattr_init () from /lib/libpthread.so.2
>>#2 0x00000008009849cb in pthread_mutexattr_init () from /lib/libpthread.so.2
>>#3 0x000000080098799f in pthread_setconcurrency () from /lib/libpthread.so.2
>
> Last time I mentioned a possible threading/python issue on FreeBSD I got
> told on the Python list that FreeBSD's threading has caused issues for
> Python before. Now, whether or not this is a problem within FreeBSD or
> simply something else (read: Python itself), I cannot say.
Here's another stack backtrace of a Trac process taking 100% CPU. It's
a good thing the server has two cores ;-)
I realize this seems to implicate FreeBSD threading, but this doesn't
happen to my Django processes, for example, so I'm still a little
inclined to suspect Trac. I have a hunch that these runaway processes
start when I ask for something that takes a long time from Trac, like an
old changeset, but then get impatient and cancel the request. Is that
possible?
Thanks in advance for any help you can offer.