Lucee freezes on 2 Ubuntu 14.04 systems

234 views
Skip to first unread message

Arc Net

unread,
Jun 20, 2016, 6:02:03 PM6/20/16
to Lucee
I am running a CFML web framework which I've migrated to Lucee 4.5 from Railo 4. Just recently 2 VMs running Lucee both began locking up (website unresponsive). One server runs Lucee 4.5.2, the other 4.5.3. The sites on these servers are not under high load. Trying to restart Lucee after it becomes unresponsive results in this error being dumped to console:

host# service lucee_ctl stop
 * Shutting down Lucee: . . . . . . . . . . [FAIL]
 * The Tomcat/Lucee process is not responding. Forcing shutdown...
 * Forcing Lucee Shutdown: [DONE]
host# Jun 21, 2016 7:04:53 AM org.apache.catalina.startup.Catalina stopServer
SEVERE: Could not contact localhost:8005. Tomcat may not be running.
Jun 21, 2016 7:04:53 AM org.apache.catalina.startup.Catalina stopServer
SEVERE: Catalina.stop:
java.net.ConnectException: Connection refused
        at java.net.PlainSocketImpl.socketConnect(Native Method)
        at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:345)
        at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)
        at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)
        at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
        at java.net.Socket.connect(Socket.java:589)
        at java.net.Socket.connect(Socket.java:538)
        at java.net.Socket.<init>(Socket.java:434)
        at java.net.Socket.<init>(Socket.java:211)
        at org.apache.catalina.startup.Catalina.stopServer(Catalina.java:450)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:497)
        at org.apache.catalina.startup.Bootstrap.stopServer(Bootstrap.java:400)
        at org.apache.catalina.startup.Bootstrap.main(Bootstrap.java:487)

The stop command failed. Attempting to signal the process to stop through OS signal.

In the logs I get a lot of these errors when stopping the server:

21-Jun-2016 07:06:12.859 WARNING [****-dev-04-vhost-****-confl1-startStop-2] org.apache.catalina.loader.WebappClassLoaderBase.clearReferencesThreads The web application [ROOT] appears to have started a thread named [Thread-9352] but has failed to stop it. This is very likely to create a memory leak. Stack trace of thread:
 java.lang.Thread.sleep(Native Method)
 lucee.commons.io.SystemUtil.sleep(SystemUtil.java:623)
 lucee.runtime.cache.ram.RamCache$Controler.run(RamCache.java:166)

21-Jun-2016 07:06:12.882 WARNING [127.0.0.1-startStop-2] org.apache.catalina.loader.WebappClassLoaderBase.clearReferencesThreads The web application [ROOT] appears to have started a thread named [Thread-6] but has failed to stop it. This is very likely to create a memory leak. Stack trace of thread:
 java.lang.Thread.sleep(Native Method)
 lucee.commons.io.SystemUtil.sleep(SystemUtil.java:623)
 lucee.runtime.engine.Controler.run(Controler.java:132)

21-Jun-2016 07:06:12.905 WARNING [****-04-vhost-****-confl1-startStop-2] org.apache.catalina.loader.WebappClassLoaderBase.clearReferencesThreads The web application [ROOT] appears to have started a thread named [Thread-36] but has failed to stop it. This is very likely to create a memory leak. Stack trace of thread:
 java.lang.Thread.sleep(Native Method)
 lucee.commons.io.SystemUtil.sleep(SystemUtil.java:623)
 lucee.runtime.cache.ram.RamCache$Controler.run(RamCache.java:166)

21-Jun-2016 07:06:12.939 WARNING ****-04-vhost-****-confl1-startStop-2] org.apache.catalina.loader.WebappClassLoaderBase.clearReferencesThreads The web application [ROOT] appears to have started a thread named [Thread-28] but has failed to stop it. This is very likely to create a memory leak. Stack trace of thread:
 java.lang.Thread.sleep(Native Method)
 lucee.commons.io.SystemUtil.sleep(SystemUtil.java:623)
 lucee.runtime.cache.ram.RamCache$Controler.run(RamCache.java:166)

and these:

"ERROR","Thread-1382897","06/05/2016","15:22:49","",";org.apache.tomcat.util.threads.TaskThread.stop0(java.lang.Object);java.lang.NoSuchMethodException: org.apache.tomcat.util.threads.TaskThread.stop0(java.lang.Object)
        at java.lang.Class.getMethod(Class.java:1786)
        at lucee.commons.io.StopThread.run(SystemUtil.java:1085)

and these:

"ERROR","Thread-1406273","06/06/2016","11:19:49","","Thread.stop(Throwable) is not supported by this JVM and failed with UnsupportedOperationException;java.lang.Throwable;java.lang.Throwable
         at java.net.SocketInputStream.socketRead0(Native Method)
        at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
        at java.net.SocketInputStream.read(SocketInputStream.java:170)
        at java.net.SocketInputStream.read(SocketInputStream.java:141)
        at com.microsoft.sqlserver.jdbc.TDSChannel.read(IOBuffer.java:1782)
        ....

The module where the last error is thrown can be any long-running task but not sure if this error is the cause of lockups or a side-effect of timeouts or the service restart.

Lucee was installed via installer on website so is a stock configuration (OpenJDK 1.8 / Tomcat 8.0 / Apache 2.4 using AJP). I've tried different settings for min and max heap sizes in setenv.sh but I've seen no noticeable increase or reduction in these lockups (which happen at least once a day now but otherwise seem random).

Jordan Michaels

unread,
Jun 20, 2016, 9:30:47 PM6/20/16
to lu...@googlegroups.com
Hi,

When the tomcat server can't be reached to shut down, the most likely cause is that it has crashed. Judging by the errors you're seeing, you most likely do have a problem in your code somewhere. I suggest using FusionReactor in a separate instance of Tomcat in order to monitor your Lucee instance and see if that can help you find more clues about what's causing your tomcat to crash.

--
Kind regards,
Jordan Michaels
Vivio Technologies
--
Win a ticket to dev.objective from Lucee via Twitter, see http://bit.ly/1UbTMWj for details, good luck and see you there...
---
You received this message because you are subscribed to the Google Groups "Lucee" group.
To unsubscribe from this group and stop receiving emails from it, send an email to lucee+un...@googlegroups.com.
To post to this group, send email to lu...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/lucee/8eea9ed1-f06a-4c7c-ab83-b02e2c63297b%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Jace Coleman

unread,
Jun 21, 2016, 11:41:24 AM6/21/16
to Lucee
We had a similar problem recently when migrating from Railo 4 to Lucee 4.5.  The production server would slow to a crawl, and consume 100% cpu until it finally became unresponsive and needed a restart.  We were restarting several times a day.  Turning debugging off via the server administrator on our production instances resolved the problem.  Hope this helps.

Arc Net

unread,
Jun 23, 2016, 3:11:23 AM6/23/16
to Lucee
Thank you for the responses, I know how difficult these things can be to debug. Here's what I know 36 hours later:

1.) Rolling back all patches to Lucee did not resolve the issue. I am now running 4.5 unpatched and it crashed again twice this morning.
2.) I have recently introduced a new website context and enabled debugging (all options on) for the first time ever. I only did this out of curiosity to see what sort of output it generates and because the context is not a production site (practically no traffic). The timing corresponds very closely with when the crashes began indicating that Jace may be on to something! I have just turned debugging off and will have to wait and see.

I am in the process of installing Fusion Reactor per Jordan's suggestion however I'm 99% sure this isn't a code bug. The reason for my certainty is that I'm reusing the same code that has run 50+ websites without crashing on Railo 3 and 4 (same OS setup). The new sites running on Lucee don't have any significant changes over the existing codebase. Moreover the sites were running fine on Lucee 4.5.1 for about 5 months until last week. There are some minor recent code changes so it isn't impossible, just unlikely.

What has changed since the crashes began was the introduction of a new web context (again reusing 99.9% of existing code) but the key difference is this context had debugging turned on in the web administrator.

This mornings' crashes generated the following error:

java.lang.ThreadDeath
        at java.lang.Thread.stop(Thread.java:850)
        at lucee.commons.io.StopThread.run(SystemUtil.java:1068)
23-Jun-2016 08:11:19.248 SEVERE [ajp-nio-8009-exec-4] org.apache.tomcat.util.net.NioEndpoint$SocketProcessor.doRun
 java.lang.ThreadDeath
        at java.lang.Thread.stop(Thread.java:850)
        at lucee.commons.io.StopThread.run(SystemUtil.java:1068)

23-Jun-2016 08:11:19.968 SEVERE [ajp-nio-8009-exec-9] org.apache.tomcat.util.net.NioEndpoint$SocketProcessor.doRun
 java.lang.ThreadDeath
        at java.lang.Thread.stop(Thread.java:850)
        at lucee.commons.io.StopThread.run(SystemUtil.java:1068)

Exception in thread "ajp-nio-8009-exec-9" Exception in thread "ajp-nio-8009-exec-10" java.lang.IllegalMonitorStateException
        at java.util.concurrent.locks.ReentrantLock$Sync.tryRelease(ReentrantLock.java:151)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer.release(AbstractQueuedSynchronizer.java:1261)
        at java.util.concurrent.locks.ReentrantLock.unlock(ReentrantLock.java:457)
        at java.util.concurrent.LinkedBlockingQueue.poll(LinkedBlockingQueue.java:474)
        at org.apache.tomcat.util.threads.TaskQueue.poll(TaskQueue.java:85)
        at org.apache.tomcat.util.threads.TaskQueue.poll(TaskQueue.java:31)
        at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1066)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1127)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at org.apache.tomcat.util.threads.TaskThread$WrappingRunnable.run(TaskThread.java:61)
        at java.lang.Thread.run(Thread.java:745)
Exception in thread "ajp-nio-8009-exec-6" java.lang.IllegalMonitorStateException
        at java.util.concurrent.locks.ReentrantLock$Sync.tryRelease(ReentrantLock.java:151)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer.release(AbstractQueuedSynchronizer.java:1261)
        at java.util.concurrent.locks.ReentrantLock.unlock(ReentrantLock.java:457)
        at java.util.concurrent.LinkedBlockingQueue.poll(LinkedBlockingQueue.java:474)
        at org.apache.tomcat.util.threads.TaskQueue.poll(TaskQueue.java:85)
        at org.apache.tomcat.util.threads.TaskQueue.poll(TaskQueue.java:31)
        at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1066)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1127)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at org.apache.tomcat.util.threads.TaskThread$WrappingRunnable.run(TaskThread.java:61)
        at java.lang.Thread.run(Thread.java:745)

The java.lang.IllegalMonitorStateException repeats 7 times. Annoyingly the log doesn't show the time of the errors so I don't know if they happened during the crash or the service restart. Just guessing here but I'd suggest this represents some kind of race condition or endless recursion on the "ReentrantLock" leading to the main thread becoming unresponsive.

Arc Net

unread,
Jul 26, 2016, 10:26:46 PM7/26/16
to Lucee
Hi All,

I found the solution to the problem outlined below. I wanted to share with the group because I believe this issue will catch others unaware.

The problem appears to be a problem with how thread.sleep() works in Lucee and/or Java 1.8. Prior to migrating from Raillo 4 / Java 1.7 I never had this issue.

The code that triggers the issue is a hack for the lack of a CFML Sleep() implementation. Newer versions of Railoo/Lucee provide a sleep function but this was not always true. In the past if you wanted a request to wait you did it using Java native methods like so:

var thread = CreateObject('java', 'java.lang.Thread');
thread.sleep(<ms>);

Since migrating to a new system based on Ubuntu 14.04 and Lucee 4 the code above appears to be causing server lockups, especially where concurrency is involved. I was able to solve the lockups by changing the above code to:

Sleep(<ms>); // Native CFML function introduced in Railo/Lucee 4

I am aware that calling Java native methods is always risky but I wanted people to know that thread.sleep() appears to cause server lockups in Lucee. I should also point out that my code doesn't explicitly kill the Java thread so it might not necessarily be thread.sleep() that is the issue but failure to explicitly kill threads (maybe Java/Lucee no longer cleans up Java native threads)? My code also runs several requests in parallel (all using thread.sleep()) so this may be a concurrency issue.

I'm curious if anybody was aware of this issue or knows the precise reason it causes the server to become unresponsive.
Reply all
Reply to author
Forward
0 new messages