here is the deal:
my apache server reached the maximum limit for workers several times in
the last week.
every time I had to restart it, and it bought me 10-20 hours, till the
next incident.
I setup mod_status with extended information, and ran a script that
fetched it every minute to see what is the server status before the
problem appears.
my server didn't disappoint and I got the problem 3-4 hours after I
setup this.
it appears that trac is leaking apache workers.
my trac is configured with wsgi, and it looks like when there is a
certain trac error (any trac error?) the request does not terminate
properly.
with wild and stupid bots out there, this is a recipe for a disaster.
attached is the output of the extended log. you can see that 143
requests are being served at once, all from the firestats.cc vhost (let
me assure you that my site is no where near that popular).
looking at the trac log, I can see many errors like:
2008-10-02 06:04:38,134 Trac[__init__] ERROR: database is locked
Traceback (most recent call last):
File
"/usr/lib/python2.4/site-packages/Trac-0.11.1-py2.4.egg/trac/web/main.py",
line 423, in _dispatch_request
dispatcher.dispatch(req)
File
"/usr/lib/python2.4/site-packages/Trac-0.11.1-py2.4.egg/trac/web/main.py",
line 222, in dispatch
req.session.save()
File
"/usr/lib/python2.4/site-packages/Trac-0.11.1-py2.4.egg/trac/web/session.py",
line 97, in save
(self.sid,))
File
"/usr/lib/python2.4/site-packages/Trac-0.11.1-py2.4.egg/trac/db/util.py",
line 50, in execute
return self.cursor.execute(sql_escape_percent(sql), args)
File
"/usr/lib/python2.4/site-packages/Trac-0.11.1-py2.4.egg/trac/db/sqlite_backend.py",
line 58, in execute
args or [])
File
"/usr/lib/python2.4/site-packages/Trac-0.11.1-py2.4.egg/trac/db/sqlite_backend.py",
line 50, in _rollback_on_error
return function(self, *args, **kwargs)
OperationalError: database is locked
just before the shit hits the fan.
before I upgraded trac, I used the RobotsTxtPlugin, but it does not seem
to be available for 0.11.
so, to summarize:
1. Bots can kill my trac site.
2. it appears that trac does not properly terminate resopnses on error.
3. robots.txt would help this situation (but not solve it as some bots
gives a flying fuck), but it's not available for 0.11.
help would be appreciated.
Omry.
using mod_wsgi in apache2.
this is my vhost file :
IWSGIDaemonProcess firestats.cc user=omry group=omry threads=25
maximum-requests=100
<VirtualHost *:80>
WSGIProcessGroup firestats.cc
WSGIApplicationGroup %{GLOBAL}
ServerName firestats.cc
ServerAlias www.firestats.cc
ServerAdmin omry
ErrorLog /home/omry/logs/firestats.cc/error.log
WSGIScriptAlias / /home/omry/www/trac/trac.wsgi
SetEnv trac.env_path /home/omry/www/trac/firestats.cc
</VirtualHost>
Attached a zip with the html file, hopefully this will work better.
Thanks.
>> using mod_wsgi in apache2.
>>
>
> But what version of mod_wsgi. Make sure you are using version 2.3 if
> using older version.
>
2.3
exact debian version is:
2.3-1~bpo40+1
>> this is my vhost file :
>>
>> IWSGIDaemonProcess firestats.cc user=omry group=omry threads=25
>> maximum-requests=100
>>
>
> I would use more than 100 requests for maximum-requests unless the
> memory leaks in your application are quite bad. Value of 1000 would be
> more reasonable.
>
I have a problem apache workers running out due to trac misbehavior.
if anything, I will reduce this number to something that will not take
down the entire apache server whenever trac is acting up.
> If you are use mod_wsgi 2.X, there is a deadlock timeout of 300
> seconds and so if the whole process was being deadlocked by problem
> code, it should automatically kill of process and restart it. Although
> this will not help if Python code had released GIL and were hung on
> access to back end database.
>
> For this latter situation, you can add to WSGIDaemonProcess the
> option:
>
> inactivity-timeout=120
>
> If all threads get hung and nothing happens for 120 seconds, the
> process should be automatically be killed off and restarted. This at
> least may get you out of situation where manual intervention required.
>
Added this, thanks.
will this incident be logged anywhere?
it can be useful to get the stuck urls when this happens.
> Other than that, looks perhaps that database requests may be hanging.
> You possibly need to attach gdb to stuck process and look at stack
> traces to see if stuck in third party C extension module such as a
> database module. See:
>
> http://code.google.com/p/modwsgi/wiki/DebuggingTechniques#Debugging_Crashes_With_GDB
>
> for pointers on how to do that.
>
This link is about debugging apache crashes.
my apache does not crash, instead if stop responding to any requests
because all the workers are busy with trac.
so I don't think this will be helpful.
> For reference, also see:
> https://www.coderesort.com/about/blog/sporadic_hang_solved
>
> This was mention on #trac. Not sure if that was by you or you were
> part of conversation.
>
osimons mentioned this to me on the irc channel, but I am not using the
[[TOC]] macro so it's not the cause for my problem.
Omry.
>
> On Oct 5, 2:40 am, Omry Yadan <o...@yadan.net> wrote:
>
>>>> IWSGIDaemonProcess firestats.cc user=omry group=omry threads=25
>>>> maximum-requests=100
>>>>
>>> I would use more than 100 requests for maximum-requests unless the
>>> memory leaks in your application are quite bad. Value of 1000 would be
>>> more reasonable.
>>>
>> I have a problem apache workers running out due to trac misbehavior.
>> if anything, I will reduce this number to something that will not take
>> down the entire apache server whenever trac is acting up.
>>
>
> What database adapter are you using? If using an older version of it
> than newest available, make sure you update it. Old versions of
> psycopg2 have problems.
>
I am using sqlite 3.3.8 with python-pysqlite2 2.3.2
>>> Other than that, looks perhaps that database requests may be hanging.
>>> You possibly need to attach gdb to stuck process and look at stack
>>> traces to see if stuck in third party C extension module such as a
>>> database module. See:
>>>
>>> http://code.google.com/p/modwsgi/wiki/DebuggingTechniques#Debugging_C...
>>>
>>> for pointers on how to do that.
>>>
>> This link is about debugging apache crashes.
>> my apache does not crash, instead if stop responding to any requests
>> because all the workers are busy with trac.
>> so I don't think this will be helpful.
>>
>
> Yes it would be helpful. When you attach the debugger to running
> process instance it will suspend all theads and you get then get a
> stack trace for all to see where each is stuck. If stuck in a C
> extension module such as a database adapter it will show you were in C
> functions it is stuck. This can help in debugging problem.
>
Can you please provide specific instructions on how to extract the stack
trace?
I know it's in the link, but next time it happens, I want to extract the
info as soon as possible and restart apache (trac is not the only site
on that apache).
don't want to start toying around with gdb while the site is down if I
don't know exactly what to do.
>>>> I have a problem apache workers running out due to trac misbehavior.
>>>> if anything, I will reduce this number to something that will not take
>>>> down the entire apache server whenever trac is acting up.
>>>>
>>> What database adapter are you using? If using an older version of it
>>> than newest available, make sure you update it. Old versions of
>>> psycopg2 have problems.
>>>
>> I am using sqlite 3.3.8 with python-pysqlite2 2.3.2
>>
>
> Which is not the latest version of either package, which means any
> problem could well have been fixed long ago.
>
changed WSGIDaemonProcess line to:
WSGIDaemonProcess firestats.cc user=omry group=omry threads=25
maximum-requests=100 inactivity-timeout=120
but it froze again (this time it took longer to happen, not sure if it's
related).
didn't had the chance to upgrade sqlite yet.
my server froze again, 155 apache processes.
this is the stack trace on one :
(gdb) thread apply all bt
Thread 1 (Thread 140422165649104 (LWP 28894)):
#0 0x00007fb6942a7182 in __read_nocancel () from /lib/libpthread.so.0
#1 0x00007fb69470258e in apr_file_read () from /usr/lib/libapr-1.so.0
#2 0x00007fb695017664 in apr_bucket_pipe_create () from
/usr/lib/libaprutil-1.so.0
#3 0x0000000000437e52 in ap_open_logs ()
#4 0x00000000004381ce in ap_scan_script_header_err_core ()
#5 0x00007fb690211071 in ?? () from /usr/lib/apache2/modules/mod_wsgi.so
#6 0x00007fb69020cd9e in ?? () from /usr/lib/apache2/modules/mod_wsgi.so
#7 0x0000000000432d09 in ap_run_handler ()
#8 0x0000000000435e82 in ap_invoke_handler ()
#9 0x0000000000442018 in ap_process_request ()
#10 0x000000000043f4dc in ap_register_input_filter ()
#11 0x0000000000439861 in ap_run_process_connection ()
#12 0x00000000004459b1 in ap_graceful_stop_signalled ()
#13 0x0000000000445c24 in ap_graceful_stop_signalled ()
#14 0x00000000004464c6 in ap_mpm_run ()
#15 0x0000000000420e70 in main ()
#0 0x00007fb6942a7182 in __read_nocancel () from /lib/libpthread.so.0
Here is the stacktrace of another:
#0 0x00007fb6942a7472 in __connect_nocancel () from /lib/libpthread.so.0
#1 0x00007fb690210591 in ?? () from /usr/lib/apache2/modules/mod_wsgi.so
#2 0x00007fb690210e2d in ?? () from /usr/lib/apache2/modules/mod_wsgi.so
#3 0x00007fb69020cd9e in ?? () from /usr/lib/apache2/modules/mod_wsgi.so
#4 0x0000000000432d09 in ap_run_handler ()
#5 0x0000000000435e82 in ap_invoke_handler ()
#6 0x0000000000442018 in ap_process_request ()
#7 0x000000000043f4dc in ap_register_input_filter ()
#8 0x0000000000439861 in ap_run_process_connection ()
#9 0x00000000004459b1 in ap_graceful_stop_signalled ()
#10 0x0000000000445c24 in ap_graceful_stop_signalled ()
#11 0x00000000004464c6 in ap_mpm_run ()
#12 0x0000000000420e70 in main ()
anything useful here?
Omry.
You suggested that the problem may have been solved in later version of
sqlite. I am now running with 3.5.9 and upgraded to python 2.5 -
but the problem is still there.
in fact, I can easily reproduce it by using an http stress testing tool
(siege) on trac /timeline url.
it brings my entire apache server down in a matter of seconds.
I rebuild mod-wsgi with debug information, and this is the stack trace:
2008-12-13 08:38:02,070 Trac[main] ERROR: database is locked
Traceback (most recent call last):
File
"/usr/lib/python2.5/site-packages/Trac-0.11.1-py2.5.egg/trac/web/main.py",
line 423, in _dispatch_request
dispatcher.dispatch(req)
File
"/usr/lib/python2.5/site-packages/Trac-0.11.1-py2.5.egg/trac/web/main.py",
line 173, in dispatch
chosen_handler)
File
"/usr/lib/python2.5/site-packages/Trac-0.11.1-py2.5.egg/trac/web/main.py",
line 286, in _pre_process_request
chosen_handler = filter_.pre_process_request(req, chosen_handler)
File
"/usr/lib/python2.5/site-packages/Trac-0.11.1-py2.5.egg/trac/versioncontrol/api.py",
line 86, in pre_process_request
self.get_repository(req.authname).sync()
File
"/usr/lib/python2.5/site-packages/Trac-0.11.1-py2.5.egg/trac/versioncontrol/cache.py",
line 89, in sync
','.join(["'%s'" % key for key in CACHE_METADATA_KEYS]))
File
"/usr/lib/python2.5/site-packages/Trac-0.11.1-py2.5.egg/trac/db/util.py",
line 51, in execute
return self.cursor.execute(sql)
File
"/usr/lib/python2.5/site-packages/Trac-0.11.1-py2.5.egg/trac/db/sqlite_backend.py",
line 58, in execute
args or [])
File
"/usr/lib/python2.5/site-packages/Trac-0.11.1-py2.5.egg/trac/db/sqlite_backend.py",
line 50, in _rollback_on_error
return function(self, *args, **kwargs)
OperationalError: database is locked
>> >> ...
>> >> #15 0x0000000000420e70 in main ()
>> >> #0 0x00007fb6942a7182 in __read_nocancel () from /lib/libpthread.so.0
>> >>
>> >> Here is the stacktrace of another:
>> >>
>> >> #0 0x00007fb6942a7472 in __connect_nocancel () from
/lib/libpthread.so.0
>> >> #1 0x00007fb690210591 in ?? () from
/usr/lib/apache2/modules/mod_wsgi.so
>> >> #2 ...
>> >> #12 0x0000000000420e70 in main ()
>> >>
>> >> anything useful here?
>> >>
> >
> > No, for a couple of reasons.
> >
> > The first is that debug information not available in mod_wsgi as not
> > compiled with debugging. Thus ensure that -g option is included when
> > compiling mod_wsgi.
> >
> > Second is that actually want the tracebacks for mod_wsgi daemon
> > process and not Apache worker processes. Use option:
> >
> > display-name=%{GROUP}88
> >
> > with WSGIDaemonProcess. Use 'ps' command to identify the mod_wsgi
> > daemon process and attach gdb to it, then get tracebacks.
> >
> > Reason for getting tracebacks for daemon processes is to ascertain if
> > threads there are stuck or not. If not, then it is an Apache
> > configuration issue which is most likely nothing to do with mod_wsgi.78
the problem is that there is NO POOLING for sqlite connections.
from sqlite_backend.py :
class SQLiteConnection(ConnectionWrapper):
"""Connection wrapper for SQLite."""
__slots__ = ['_active_cursors']
poolable = have_pysqlite and os.name == 'nt' and sqlite_version >= 30301
problem is that my linux is not an 'nt' :), so poolable is false.
this in turn cause return_cnx to not pool the connection when it's returned:
so, effectively - there is currently no connection pooling for sqlite
for non nt machines.
so, if you put your server under stress, it spawns many many sqlite
connections, which eventually causes 'database is locked', and
'connection timeout' errors.
This restriction has been introduced in [4493], as it seems to have been
causing crashes. This seems to be due to a bug in pysqlite:
http://oss.itsystementwicklung.de/trac/pysqlite/ticket/187
Unfortunately, there has been no activity on that ticket for 23 months...
I'd be interested to know if you still get the crash if you enable
pooling on linux.
-- Remy
Enabling pooling seems to have resolved my situation, and I have not
seen any crashes yet (in the last 12 hours).
Quoting the ticket:
"But on Linux, you very quickly run into the following crash (all it
takes are 2 concurrent requests to the timeline):"
since my stress test was 15 concurrent connections to the timeline and I
had no problems, I assume this is no longer an issue.
Hi Simon, You can try the sqlite3_threadsafe call :
http://www.sqlite.org/c3ref/threadsafe.html
Remy mentioned that it's probably a bug in pysqlite:
http://oss.itsystementwicklung.de/trac/pysqlite/ticket/187
Maybe it was resolved since then?
int sqlite3_threadsafe(void){ return SQLITE_THREADSAFE; }
>> I'd be interested to know if you still get the crash if you enable
>> pooling on linux.
>>
> Enabling pooling seems to have resolved my situation, and I have not
> seen any crashes yet (in the last 12 hours).
> Quoting the ticket:
> "But on Linux, you very quickly run into the following crash (all it
> takes are 2 concurrent requests to the timeline):"
>
> since my stress test was 15 concurrent connections to the timeline
> and I
> had no problems, I assume this is no longer an issue.
I think it's important to remember is that a connection pool's job is
to reduce the connection overhead by reusing existing connections.
This is important for most server based databases where this overhead
can be very noticeable. But with sqlite's in-process design this
benefit should be much less noticeable if it's even measurable.
SQLite's Achilles' heel is concurrency. Only one active transaction is
allowed at a time, regardless if the connection is "pooled" or not.
So that's why I can't really understand how enabling connection
pooling could on its own so clearly make all your concurrency related
problems go away.
/ Jonas
I tested using siege, a simple http stress tester.
my test was simply to stress test the timeline url with 15 concurrent
clients.
the results of this - before enabling pooling - is that soon enough trac
beging to timeout and/or return http 500.
Osimons suggester to check the number open files by the wsgi process,
which shown that while under stress the number sometimes jumped to a
very high number of open trac.db file handles (almost 1000 in one case).
once I figured that pooling was disabled, I hacked the code to re-enable
it, and noticed that the number of open tracd.db file handles no longed
exceed the number of allowed connections in the pool.
also, my stress test now works fine.
while looking at the code, I could not figure out when the connections
to sqlite are actually closed, and I suspect that the problem is that
they are actually
never closed explicitly - only when they are garbage collected.
this is with the almost trivial load caused by 15 concurrent connections
to a strong server with plenty of RAM.
as for system limits, this is the output if ulimit -a :
core file size (blocks, -c) 0
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 0
file size (blocks, -f) unlimited
pending signals (-i) 16308
max locked memory (kbytes, -l) 32
max memory size (kbytes, -m) unlimited
open files (-n) 1024
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 0
stack size (kbytes, -s) 8192
cpu time (seconds, -t) unlimited
max user processes (-u) 16308
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited