silent crash ... process still on

150 views
Skip to first unread message

Stéphane MERLE

unread,
Feb 3, 2017, 4:19:45 AM2/3/17
to Lucee
Hi,

from time to time I got a front crash without explanation ...

my setup is 2 loadbalancing VM in nginx with keepalive, and 2 others VM with lucee (apache2/tomcat conf) : lucee 4.5.4.017

right now one of my node is dead, and don't reply to any page call.

on the VM, I still got the process : 
lucee      989     1  1 03:30 ?        00:05:39 /opt/lucee/jdk/jre/bin/java -Djava.util.logging.config.file=/opt/lucee/tomcat/conf/logging.properties -Djava.util.logging.manager=org.apache.juli.ClassLoaderLogManager -Xms256m -Xmx512m -XX:MaxPermSize=128m -javaagent:lib/lucee-inst.jar -javaagent:/opt/newrelic/newrelic.jar -Djava.endorsed.dirs=/opt/lucee/tomcat/endorsed -classpath /opt/lucee/tomcat/bin/bootstrap.jar:/opt/lucee/tomcat/bin/tomcat-juli.jar -Dcatalina.base=/opt/lucee/tomcat -Dcatalina.home=/opt/lucee/tomcat -Djava.io.tmpdir=/opt/lucee/tomcat/temp org.apache.catalina.startup.Bootstrap start




Apache is ok I can see that it try to send request to lucee : 

192.168.2.9 - - [03/Feb/2017:10:08:07 +0100] "GET /monitoring/test.cfm?&nocache=575120 HTTP/1.1" 502 676 "-" "libwww-perl/5.834"
192.168.2.9 - - [03/Feb/2017:10:08:35 +0100] "GET /api/V1.0/ping/binary HTTP/1.0" 502 665 "-" "Mozilla/5.0 (compatible; Chirp/1.0; +http://www.binarycanary.com/chirp.cfm)"
192.168.2.9 - - [03/Feb/2017:10:08:43 +0100] "GET /api/V1.0/ping/binary HTTP/1.0" 502 665 "-" "Mozilla/5.0 (compatible; Chirp/1.0; +http://www.binarycanary.com/chirp.cfm)"
192.168.2.10 - - [03/Feb/2017:10:09:02 +0100] "GET /monitoring/test.cfm?&nocache=316214 HTTP/1.1" 502 676 "-" "libwww-perl/5.834"
192.168.2.9 - - [03/Feb/2017:10:09:02 +0100] "GET /monitoring/test.cfm?&nocache=286391 HTTP/1.1" 502 676 "-" "libwww-perl/5.834"
192.168.2.10 - - [03/Feb/2017:10:09:07 +0100] "GET /monitoring/test.cfm?&nocache=139938 HTTP/1.1" 502 676 "-" "libwww-perl/5.834"
192.168.2.9 - - [03/Feb/2017:10:09:07 +0100] "GET /monitoring/test.cfm?&nocache=146543 HTTP/1.1" 502 676 "-" "libwww-perl/5.834"
192.168.2.9 - - [03/Feb/2017:10:09:35 +0100] "GET /api/V1.0/ping/binary HTTP/1.0" 502 665 "-" "Mozilla/5.0 (compatible; Chirp/1.0; +http://www.binarycanary.com/chirp.cfm)"
192.168.2.9 - - [03/Feb/2017:10:09:43 +0100] "GET /api/V1.0/ping/binary HTTP/1.0" 502 665 "-" "Mozilla/5.0 (compatible; Chirp/1.0; +http://www.binarycanary.com/chirp.cfm)"


and
Fri Feb 03 10:14:08.070572 2017] [proxy:error] [pid 9148:tid 140559440111360] [client 192.168.2.9:57315] AH00898: Error reading from remote server returned by /monitoring/test.cfm
[Fri Feb 03 10:14:35.519512 2017] [proxy_http:error] [pid 9205:tid 140559322613504] (70007)The timeout specified has expired: [client 192.168.2.9:58900] AH01102: error reading status line from remote server 127.0.0.1:8888
[Fri Feb 03 10:14:35.519595 2017] [proxy:error] [pid 9205:tid 140559322613504] [client 192.168.2.9:58900] AH00898: Error reading from remote server returned by /api/V1.0
[Fri Feb 03 10:14:43.591082 2017] [proxy_http:error] [pid 9205:tid 140559490467584] (70007)The timeout specified has expired: [client 192.168.2.9:59415] AH01102: error reading status line from remote server 127.0.0.1:8888
[Fri Feb 03 10:14:43.591202 2017] [proxy:error] [pid 9205:tid 140559490467584] [client 192.168.2.9:59415] AH00898: Error reading from remote server returned by /api/V1.0



but the error log of lucee hasen't moved since 5h04 this morning : it's 10h15 here :
-rw-rw-r-- 1 lucee lucee 97194864 Feb  3 05:04 catalina.out
-rw-rw-r-- 1 lucee lucee    26937 Feb  3 05:04 localhost_access_log.2017-02-03.txt
-rw-rw-r-- 1 lucee lucee     5545 Feb  3 05:04 catalina.2017-02-03.log


03-Feb-2017 03:30:04.328 INFO [Thread-25] org.apache.coyote.AbstractProtocol.pause Pausing ProtocolHandler ["http-nio-8888"]
03-Feb-2017 03:30:04.386 INFO [Thread-25] org.apache.coyote.AbstractProtocol.pause Pausing ProtocolHandler ["ajp-nio-8009"]
03-Feb-2017 03:30:04.441 INFO [Thread-25] org.apache.catalina.core.StandardService.stopInternal Stopping service Catalina
03-Feb-2017 03:30:04.534 WARNING [api.bpreducer.com-startStop-2] org.apache.catalina.loader.WebappClassLoaderBase.clearReferencesThreads The web application [ROOT] appears to have started a thread named [Thread-22] but has failed to stop it. This is very likely to create a memory leak. Stack trace of thread:
 java.lang.Thread.sleep(Native Method)
 lucee.commons.io.SystemUtil.sleep(SystemUtil.java:623)
 lucee.runtime.engine.Controler.run(Controler.java:132)
03-Feb-2017 03:30:04.543 WARNING [api.bpreducer.com-startStop-2] org.apache.catalina.loader.WebappClassLoaderBase.clearReferencesThreads The web application [ROOT] appears to have started a thread named [Thread-1231551] but has failed to stop it. This is very likely to create a memory leak. Stack trace of thread:
 java.lang.Object.wait(Native Method)
 lucee.commons.io.SystemUtil.wait(SystemUtil.java:646)
 lucee.commons.io.retirement.RetireOutputStreamFactory$RetireThread.run(RetireOutputStreamFactory.java:57)
03-Feb-2017 03:30:04.677 INFO [Thread-25] org.apache.coyote.AbstractProtocol.stop Stopping ProtocolHandler ["http-nio-8888"]
03-Feb-2017 03:30:04.703 INFO [Thread-25] org.apache.coyote.AbstractProtocol.stop Stopping ProtocolHandler ["ajp-nio-8009"]
03-Feb-2017 03:30:04.737 INFO [Thread-25] org.apache.coyote.AbstractProtocol.destroy Destroying ProtocolHandler ["http-nio-8888"]
03-Feb-2017 03:30:04.771 INFO [Thread-25] org.apache.coyote.AbstractProtocol.destroy Destroying ProtocolHandler ["ajp-nio-8009"]
03-Feb-2017 03:30:51.085 INFO [main] org.apache.catalina.core.AprLifecycleListener.lifecycleEvent The APR based Apache Tomcat Native library which allows optimal performance in production environments was not found on the java.library.path: /usr/java/packages/lib/amd64:/usr/lib64:/lib64:/lib:/usr/lib
03-Feb-2017 03:30:51.449 INFO [main] org.apache.coyote.AbstractProtocol.init Initializing ProtocolHandler ["http-nio-8888"]
03-Feb-2017 03:30:51.506 INFO [main] org.apache.tomcat.util.net.NioSelectorPool.getSharedSelector Using a shared selector for servlet write/read
03-Feb-2017 03:30:51.517 INFO [main] org.apache.coyote.AbstractProtocol.init Initializing ProtocolHandler ["ajp-nio-8009"]
03-Feb-2017 03:30:51.554 INFO [main] org.apache.tomcat.util.net.NioSelectorPool.getSharedSelector Using a shared selector for servlet write/read
03-Feb-2017 03:30:51.563 INFO [main] org.apache.catalina.startup.Catalina.load Initialization processed in 3425 ms
03-Feb-2017 03:30:51.634 INFO [main] org.apache.catalina.core.StandardService.startInternal Starting service Catalina
03-Feb-2017 03:30:51.635 INFO [main] org.apache.catalina.core.StandardEngine.startInternal Starting Servlet Engine: Apache Tomcat/8.0.24
03-Feb-2017 03:31:04.670 INFO [api.bpreducer.com-startStop-1] org.apache.jasper.servlet.TldScanner.scanJars At least one JAR was scanned for TLDs yet contained no TLDs. Enable debug logging for this logger for a complete list of JARs that were scanned but no TLDs were found in them. Skipping unneeded JARs during scanning can improve startup time and JSP compilation time.
03-Feb-2017 03:31:08.543 INFO [127.0.0.1-startStop-1] org.apache.catalina.startup.HostConfig.deployDirectory Deploying web application directory /opt/lucee/tomcat/webapps/ROOT
03-Feb-2017 03:31:17.501 INFO [127.0.0.1-startStop-1] org.apache.jasper.servlet.TldScanner.scanJars At least one JAR was scanned for TLDs yet contained no TLDs. Enable debug logging for this logger for a complete list of JARs that were scanned but no TLDs were found in them. Skipping unneeded JARs during scanning can improve startup time and JSP compilation time.
03-Feb-2017 03:31:17.659 INFO [127.0.0.1-startStop-1] org.apache.catalina.startup.HostConfig.deployDirectory Deployment of web application directory /opt/lucee/tomcat/webapps/ROOT has finished in 9,115 ms
03-Feb-2017 03:31:17.666 INFO [main] org.apache.coyote.AbstractProtocol.start Starting ProtocolHandler ["http-nio-8888"]
03-Feb-2017 03:31:17.687 INFO [main] org.apache.coyote.AbstractProtocol.start Starting ProtocolHandler ["ajp-nio-8009"]
03-Feb-2017 03:31:17.689 INFO [main] org.apache.catalina.startup.Catalina.start Server startup in 26126 ms
03-Feb-2017 03:31:18.366 INFO [api-bpreducer-com-api-smtp-confl2-startStop-1] org.apache.catalina.startup.HostConfig.deployDescriptor Deploying configuration descriptor /opt/lucee/tomcat/conf/Catalina/api-bpreducer-com-api-smtp-confl2/ROOT.xml
03-Feb-2017 03:31:18.508 INFO [api-bpreducer-com-api-smtp-confl2-startStop-1] org.apache.catalina.startup.HostConfig.deployDescriptor Deployment of configuration descriptor /opt/lucee/tomcat/conf/Catalina/api-bpreducer-com-api-smtp-confl2/ROOT.xml has finished in 142 ms
03-Feb-2017 05:04:01.527 SEVERE [http-nio-8888-exec-30] com.microsoft.sqlserver.jdbc.TDSParser.throwUnexpectedTokenException ConnectionID:229 ClientConnectionId: ea3556d2-24d6-473b-9f80-9adbce53dcc6: getNextResult: Encountered unexpected unknown token (0x0)
03-Feb-2017 05:04:01.527 SEVERE [http-nio-8888-exec-30] com.microsoft.sqlserver.jdbc.TDSReader.throwInvalidTDSToken ConnectionID:229 ClientConnectionId: ea3556d2-24d6-473b-9f80-9adbce53dcc6 got unexpected value in TDS response at offset:0



I got a reboot at 3h30 for this VM !

I can see a pb with a query on a MSSQL server but this should CRASH the server or at least should kill the process ... (that monit would restart)

can you find why it crash and how I could correct it or setup monit to detect the crash and restart the lucee service ....

Thanks for your help !!!




Don Quist

unread,
Feb 3, 2017, 12:24:49 PM2/3/17
to Lucee
Hopefully someone can give you insight into the real issue of the silent crash, but as for monit you can use something like this to check Tomcasts port directly https://gist.github.com/joeyimbasciano/3996208

Stéphane MERLE

unread,
Feb 4, 2017, 3:47:25 AM2/4/17
to Lucee
Hi,
thanks for you help ...

I already got this for monit : 
check process tomcat with pidfile /opt/lucee/tomcat/work/tomcat.pid
        start   "/etc/init.d/lucee_ctl start"
        stop    "/etc/init.d/lucee_ctl stop"
        restart "/etc/init.d/lucee_ctl restart"
        if failed port 8888 then alert
        if failed port 8888 for 5 cycles then restart
        if failed port 8888 for 10 cycles then exec /sbin/reboot


and it's not triggering ...

I can see socket on port 8888 : 

tcp      415      0 127.0.0.1:8888          127.0.0.1:46667         CLOSE_WAIT  999        229261      1050/java
tcp        1      0 127.0.0.1:8888          127.0.0.1:45879         CLOSE_WAIT  999        224405      1050/java
tcp      415      0 127.0.0.1:8888          127.0.0.1:46675         CLOSE_WAIT  999        229316      1050/java
tcp      414      0 127.0.0.1:8888          127.0.0.1:46056         CLOSE_WAIT  999        225484      1050/java
tcp        0      0 127.0.0.1:47282         127.0.0.1:8888          FIN_WAIT2   0          0           -
tcp      496      0 127.0.0.1:8888          127.0.0.1:46562         CLOSE_WAIT  999        228624      1050/java
tcp      496      0 127.0.0.1:8888          127.0.0.1:45890         CLOSE_WAIT  999        224466      1050/java
tcp      496      0 127.0.0.1:8888          127.0.0.1:45851         CLOSE_WAIT  999        224225      1050/java
tcp        1      0 127.0.0.1:8888          127.0.0.1:46660         CLOSE_WAIT  999        229225      1050/java
tcp      296      0 127.0.0.1:8888          127.0.0.1:44622         CLOSE_WAIT  999        216563      1050/java
tcp      496      0 127.0.0.1:8888          127.0.0.1:45031         CLOSE_WAIT  999        218802      1050/java
tcp      303      0 127.0.0.1:8888          127.0.0.1:44477         CLOSE_WAIT  999        215959      1050/java
tcp      414      0 127.0.0.1:8888          127.0.0.1:46000         CLOSE_WAIT  999        225158      1050/java
tcp      414      0 127.0.0.1:8888          127.0.0.1:45782         CLOSE_WAIT  999        223801      1050/java
tcp      414      0 127.0.0.1:8888          127.0.0.1:45724         CLOSE_WAIT  999        223448      1050/java
tcp      414      0 127.0.0.1:8888          127.0.0.1:47330         ESTABLISHED 999        236796      1050/java
tcp      496      0 127.0.0.1:8888          127.0.0.1:45195         CLOSE_WAIT  999        219808      1050/java
tcp      415      0 127.0.0.1:8888          127.0.0.1:46114         CLOSE_WAIT  999        225855      1050/java


I have no clue of what is stucking ...
Reply all
Reply to author
Forward
0 new messages