Hello.
Linux 2.6.36-hardened-r6 #2 SMP Mon Jan 31 00:18:27 EET 2011 i686 AMD Phenom(tm) 9650 Quad-Core Processor
Python 2.7.1
cherrypy-3.2.0
Today my cherrypy app have failed 3 times, with error "Too many open files". It is limited to 1024 file descriptor. In usual situation it use ~10 file descriptors. When application was in "fail" state I have looked in /proc/$pid/fd to see what descriptors is used. And all of them was client sockets. So I am shure that it is was not fd leak in my app(it didn't use sockets).
According to nginx logs(it placed in front of app) - when app fail, it was unable to connect to and got error "connect() failed (111: Connection refused) while connecting to upstream". But app process does not died and ignore SIGTERM. From app logs I know that some request handles start to process but not ends. I know that they freeze after dispatcher choose handler, but before handler starts. I use _cp_dispatch method for routing and log route process, so I can see in what phase of request processing it freeze.
I can guess only one place where it cant freeze. App use cherrypy.session module, and configured to store them in files. I think that there is some kind of dead lock on session lock phase. But I can't reproduce problem on test server so it is only hypothesis. And not want it repeats on production server.
Now I switch to session in memory. If my hypothesis is correct - app will not fail any more.
I need advice of method how to reproduce and track this problem.
And one more thing - is it should stop working process in this situation? Because process not died, manager process did not restart app, and entire site become unavailable.
PS Sorry my ugly english.
Small part of app log:
2011-07-11 21:03:19,682 13091:-1599640720 [ DEBUG:root] route('res') => <route_node: [<rr: u'logo.png'>]>
2011-07-11 21:03:19,682 13091:-1599640720 [ DEBUG:root] route('logo.png') => <class 'puzzleit.http.migration.proxy_auth'>
2011-07-11 21:03:19,684 13091:-1599640720 [ DEBUG:root] Run <puzzleit.http.migration.proxy_auth object at 0x1ca235ac>, remote: httputil.Host('109.234.11.212', 50304, '')
2011-07-11 21:03:19,685 13091:-1599640720 [ DEBUG:sqlalchemy.pool.QueuePool.0x...98ec] Connection <_mysql.connection open to 'gen0.mysql.stabilis.int' at 1c57014c> checked out from pool
2011-07-11 21:03:19,685 13091:-1599640720 [ INFO:sqlalchemy.engine.base.Engine.0x...9aac] BEGIN (implicit)
2011-07-11 21:03:19,686 13091:-1599640720 [ INFO:sqlalchemy.engine.base.Engine.0x...9aac] SELECT session_share.user_idnr AS session_share_user_idnr
2011-07-11 21:03:19,686 13091:-1599640720 [ INFO:sqlalchemy.engine.base.Engine.0x...9aac] ('f38075e60c7c019228f60ed95bdbc214',)
2011-07-11 21:03:19,687 13091:-1599640720 [ DEBUG:sqlalchemy.engine.base.Engine.0x...9aac] Col ('session_share_user_idnr',)
2011-07-11 21:03:19,688 13091:-1599640720 [ DEBUG:root] There is no session_share record for PHPSESSID="f38075e60c7c019228f60ed95bdbc214"
2011-07-11 21:03:19,688 13091:-1599640720 [ INFO:sqlalchemy.engine.base.Engine.0x...9aac] COMMIT
2011-07-11 21:03:19,691 13091:-1599640720 [ DEBUG:sqlalchemy.pool.QueuePool.0x...98ec] Connection <_mysql.connection open to 'gen0.mysql.stabilis.int' at 1c57014c> being returned to pool
2011-07-11 21:03:19,691 13091:-1599640720 [ DEBUG:root] Perform redirect: HTTPRedirect([], 304)
2011-07-11 21:03:19,692 13091:-1599640720 [ INFO:cherrypy.access.465627660] 109.234.11.212 - - [11/Jul/2011:21:03:19] "GET /res/logo.png HTTP/1.0" 304 - "http://puzzleit.org/" "Mozilla/5.0 (Windows NT 5.1) AppleWebKit/534.30 (KHTML, like Gecko) Chrome/12.0.742.112 Safari/534.30"
2011-07-11 21:03:19,694 13091:-1599640720 [ DEBUG:root] route('res') => <route_node: [<rr: u'logo.png'>]>
2011-07-11 21:03:19,694 13091:-1599640720 [ DEBUG:root] route('logo.png') => <class 'puzzleit.http.migration.proxy_auth'>
2011-07-11 21:03:19,695 13091:-1599640720 [ DEBUG:root] Run <puzzleit.http.migration.proxy_auth object at 0x1c8c5d8c>, remote: httputil.Host('95.58.122.0', 50306, '')
2011-07-11 21:03:19,696 13091:-1599640720 [ DEBUG:sqlalchemy.pool.QueuePool.0x...98ec] Connection <_mysql.connection open to 'gen0.mysql.stabilis.int' at 1c5be8fc> checked out from pool
2011-07-11 21:03:19,696 13091:-1599640720 [ INFO:sqlalchemy.engine.base.Engine.0x...9aac] BEGIN (implicit)
2011-07-11 21:03:19,697 13091:-1599640720 [ INFO:sqlalchemy.engine.base.Engine.0x...9aac] SELECT session_share.user_idnr AS session_share_user_idnr
2011-07-11 21:03:19,697 13091:-1599640720 [ INFO:sqlalchemy.engine.base.Engine.0x...9aac] ('f19d58d61f16bd33b1eb13020db156af',)
2011-07-11 21:03:19,698 13091:-1599640720 [ DEBUG:sqlalchemy.engine.base.Engine.0x...9aac] Col ('session_share_user_idnr',)
2011-07-11 21:03:19,698 13091:-1599640720 [ DEBUG:root] There is no session_share record for PHPSESSID="f19d58d61f16bd33b1eb13020db156af"
2011-07-11 21:03:19,699 13091:-1599640720 [ INFO:sqlalchemy.engine.base.Engine.0x...9aac] COMMIT
2011-07-11 21:03:19,701 13091:-1599640720 [ DEBUG:sqlalchemy.pool.QueuePool.0x...98ec] Connection <_mysql.connection open to 'gen0.mysql.stabilis.int' at 1c5be8fc> being returned to pool
2011-07-11 21:03:19,702 13091:-1599640720 [ DEBUG:root] Perform redirect: HTTPRedirect([], 304)
2011-07-11 21:03:19,702 13091:-1599640720 [ INFO:cherrypy.access.465627660] 95.58.122.0 - - [11/Jul/2011:21:03:19] "GET /res/logo.png HTTP/1.0" 304 - "http://puzzleit.org/54a6db690762fbc23fe3a648b7452d5e/rating/" "Opera/9.80 (Windows NT 6.1; U; MRA 5.8 (build 4157); ru) Presto/2.8.131 Version/11.11"
2011-07-11 21:03:20,486 13091:-1599640720 [ DEBUG:root] route('res') => <route_node: [<rr: u'logo.png'>]>
2011-07-11 21:03:20,486 13091:-1599640720 [ DEBUG:root] route('logo.png') => <class 'puzzleit.http.migration.proxy_auth'>
2011-07-11 21:26:23,238 13091:-1591248016 [ ERROR:cherrypy.error] [11/Jul/2011:21:26:23] ENGINE Error in HTTP server: shutting down
Traceback (most recent call last):
File "/usr/lib/python2.7/site-packages/cherrypy/process/servers.py", line 187, in _start_http_thread
File "/usr/lib/python2.7/site-packages/cherrypy/wsgiserver/__init__.py", line 1765, in start
File "/usr/lib/python2.7/site-packages/cherrypy/wsgiserver/__init__.py", line 1800, in tick
File "/usr/lib/python2.7/socket.py", line 202, in accept
error: [Errno 24] Too many open files
2011-07-11 21:26:23,238 13091:-1591248016 [ INFO:cherrypy.error] [11/Jul/2011:21:26:23] ENGINE Bus STOPPING
... but even after this - process 13091 continue to live, and even ignore SIGTERM. There was not log cut between 21:03:20 and 21:26:23, it just now serve requests.