Ok, after a significant amount of testing I believe I have determined the source of my server instabilities after our upgrade to 3.5 as well as the change in 3.5 that triggered it.
To start with, I devised a jMeter test that only hit the ColdBox sample applications randomly at about 5 req/sec to test outside of my application. I am attaching the .jmx file for anyone who wants to try it. (You'll want to edit the property defined in the "test Plan" node that sets the server used for all the HTTP samplers.) I then monitored running requests, memory usage, and stack traces with SeeFusion.
The symptoms were completely reproducible hitting the ColdBox sample apps with ColdBox 3.5 and when I rolled the ColdBox version back to 3.1, I could run the test overnight without a blip in performance.
To review, the symptoms where hanging requests (always trying to access the CGI scope, or path info which is pulled from IIS via the JRUN connector), high memory usage, and the server eventually refusing requests and having to be restarted.
If you open the attached image called SeeFusion graphs.png, you will see a test using CB 3.1 and 3.5 side-by-side.
- In the 3.1 test, it was run overnight with a steady 5-6 requests processed a minute and memory showed a healthy saw-tooth pattern that did not climb.
- In the 3.5 test, the number of running requests tapered off over the course of an hour until the server was unresponsive. Memory usage was erratic and climbed very high during the test.
Also, not shown in this screenshot, are 3 requests that hung for 5 minutes apiece during the 3.5 test. Open the attached file "Example long-running request with ColdBox 3.5.txt" to see an example of what they were stuck on while they hung. You can see that CF is trying to connect to IIS to get the real path of the template being processed.
The breakthrough came when I opened up the CF Server Monitor to take a look at the CFThreads that were running (SeeFusion doesn't show these unless you look at the full stack trace).
Look at the attached image called "CFThread queues with ColdBox 3.5.png" and you will see that literally thousands of CFThreads were being spawned by ColdBox, all but 10 of which were being queued of course. After about 5 minutes, the queues would clear out, but memory usage would climb substantially during this time and performance would degrade.
To figure out what was going on during those spikes, refer to the attached file "example full thread dump with ColdBox 3.5.txt" and look at thread "cfthread-3". You'll see this was a thread spawned to reap the object store in CacheBox. Before it reaped, it tapped LogBox to log its operation which hit the asychRollingFileAppender. The file appender decided to rotate the log file and when it tried to create the zip.cfc utility, it hung trying to access the CGI scope, which sort of makes sense since we're inside of a CFThread, and those don't have access to the CGI scope after the initial HTTP request has closed when connecting to the web server via a connector. (
http://help.adobe.com/en_US/ColdFusion/9.0/CFMLRef/WSc3ff6d0ea77859461172e0811cbec22c24-75a8.html)
When the logging CacheBox reap hung, it help open its exclusive named lock on my CacheBox instance, which in turn, blocked every other CFThread that was spawned in an attempt to reap as well. See the following threads in the full thread dump: cfthread-11, cfthread-16, cfthread-15, cfthread-14, cfthread-8, cfthread-12, cfthread-10, and cfthread-7.
The long-running requests I was seeing SeeFusion which weren't threads would happen at the same time. I'm not sure why since a regular HTTP thread should have access to CGI vars, but I assume it was due to some sort of contention in the JRUN connector. I'm open to the idea that we have something configured wonky there. (CF9.01, IIS7.5)
So, what changed in ColdBox 3.5? Good question. The answer is the addition of lines 587 and 651 in system\cache\providers\CacheBoxProvider.cfc that log an info message. Funny thing is, there's nothing really wrong with those two lines of code. If you happen to have the asynch rolling file appender configured, then you'll eventually hit line 73 of system\core\util\Util.cfc which is where the reaping thread would hang every time.
instance.ioFile = CreateObject("java","java.io.File");
What the hey, there's nothing related to the CGI scope in that line. We're just setting something into the variables scope, and variables is searched prior to CGI. Go back to "cfthread-3" in "example full thread dump with ColdBox 3.5.txt" and you'll see that since variables.instance doesn't exist yet, CF is searching through all the scopes it can before it gives up and creates variables.instance.
This may be the most subtle and odd fix ever, but I was able to make all of the ColdBox 3.5 issues completely disappear by adding this line into zip.cfc
variables.instance = {};
This defined instance in the variables scope, which stopped the built-in scope hunting from ever reaching CGI, which allowed the log file to rotate, which allowed CacheBox to log its reap, which didn't backlog thousands of reap threads, that didn't eat up all my memory and cause other random requests to to also hang when accessing the IIS connectors.
Takeaways:
- For starters, I'd say that CF shouldn't be trying to scope hunt in the CGI scope when it's in a CFThread. I'm not sure why CF is hanging when it does, but I'd love to get feedback on anything that might be screwy with our JRUN/IIS connectors
- My change to zip.cfc doesn't really fix the issue, it just avoids it really. It makes me nervous that anything that kicks off scope-hunting inside a CFThread can have this kind of results.
- In theory, CacheBox also shouldn't spawn thousands of reaping threads. Perhaps the object store should save some sort of state to say, "Hey, I'm reaping now, no more threads please." Then, when reaping completes, reset that flag back. In my instance, the object store would still never get reaped since the thread eventually errors out, but at least it wouldn't create thousands of waiting threads.
- Anything else I'm missing, or any input people want to share? I'd love to see some people grab the jMeter test and give it a try. I want to know if the CGI access inside a thread hanging like that is normal or not.
Thanks!
~Brad
-------- Original Message --------
Subject: [ColdBox 3.5 Final] Strange behaviors after upgrade
From: <
br...@bradwood.com>
Date: Tue, May 22, 2012 5:58 pm
To: "ColdBox List" <
col...@googlegroups.com>
Hello all, after releasing ColdBox 3.5 to our production servers as part of a release last week, we began seeing very weird behaviors across all 4 of our production servers which adversely affected performance to the point where we eventually had to revert back to ColdBox 3.1 until I can figure out the source of the issues.
I haven't ruled out the possibility of something stupid going on in my own code base, but I ran many hours of jmeter tests and can very reliably produce the behaviors with 3.5 and they completely go away as soon as I swap out for the previous version of ColdBox. I wanted to see if anyone else has seen these issue at all. My environment is 64-bit Windows, CF9 Ent, IIS7.
The three main behaviors I saw were:
- Randomly hanging requests. These would pop up within 20-30 minutes of any load on the server, and affected any page on the site. They would time out after about 5 minutes. A number of different pages would hang, but in EVERY case, the line of code was accessing a CGI-scoped variable and the top part of the stack trace was identical. My understanding of CGI vars is that CF talks to IIS over the JRUN connector and asks it for the headers on the request. The threads were all stuck at a socketRead() and could not be killed (though they would time out eventually)
Full thread dump Java HotSpot(TM) 64-Bit Server VM (14.3-b01):"jrpp-6" runnable
at java.net.SocketInputStream.socketRead0(Native Method) at java.net.SocketInputStream.read(SocketInputStream.java:129) at java.io.BufferedInputStream.read1(BufferedInputStream.java:256) at java.io.BufferedInputStream.read(BufferedInputStream.java:317) at jrun.servlet.jrpp.ProxyEndpoint.readFully(ProxyEndpoint.java:581)
at jrun.servlet.jrpp.ProxyEndpoint.readFully(ProxyEndpoint.java:573)
at jrun.servlet.jrpp.ProxyEndpoint.readString(ProxyEndpoint.java:622)
at jrun.servlet.jrpp.ProxyEndpoint.getHeader(ProxyEndpoint.java:762)
at javax.servlet.http.HttpServletRequestWrapper.getHeader(HttpServletRequestWrapper.java:124)
at javax.servlet.http.HttpServletRequestWrapper.getHeader(HttpServletRequestWrapper.java:124)
at coldfusion.runtime.CgiScope.resolve(CgiScope.java:221)
at coldfusion.runtime.CgiScope.resolveName(CgiScope.java:274)
at coldfusion.runtime.Scope.get(Scope.java:60)
at coldfusion.runtime.ScopeSearchResult.searchForSubkeys(ScopeSearchResult.java:104)
at coldfusion.runtime.LocalScope.search(LocalScope.java:417)
at coldfusion.runtime.NeoPageContext.searchScopes(NeoPageContext.java:738)
at coldfusion.runtime.NeoPageContext.searchScopes(NeoPageContext.java:662)
at coldfusion.runtime.CfJspPage.resolveCanonicalName(CfJspPage.java:1718)
at coldfusion.runtime.CfJspPage._resolve(CfJspPage.java:1677)
at coldfusion.runtime.CfJspPage._resolveAndAutoscalarize(CfJspPage.java:1812)
at coldfusion.runtime.CfJspPage._resolveAndAutoscalarize(CfJspPage.java:1805)
- Higher-than-normal memory usage My servers are normally happy around 1.5 to 2.5 GBs of heap used. Memory usage jumped to 4 - 6GB of heap in use. (Total heap is 8GB) I think there's a chance the high memory usage was caused by the hanging requests, but that's just a hunch.
- Null Pointer Exceptions on every request The worst behavior of all is after a few hours of uptime, servers would just completely fall over and begin returning the following error on every single request. What was interesting was our non-ColdBox legacy site on the same server would respond fine, but the ColdBox site would just begin erroring out on every page hit. This would continue until the ColdFusion service was restarted. The error was deep in the bowels of Java and usually seems to be related to resolving interface implemntations such as the trace below which was thrown when Controller.cfc tried to create an instance of coldbox.system.ioc.Injector. It's interesting to note that Injector implements an interface in 3.5, but not in the previous version we rolled back to.
java.lang.NullPointerException
at coldfusion.runtime.InterfaceTemplateProxyCache.getTemplateProxyFromPath(InterfaceTemplateProxyCache.java:125)
at coldfusion.runtime.TemplateProxy.verifyInterfaceImpl(TemplateProxy.java:1072)
at coldfusion.runtime.TemplateProxy.verifyInterfacesImpl(TemplateProxy.java:880)
at coldfusion.cfc.ComponentProxyFactory.getProxy(ComponentProxyFactory.java:67)
at coldfusion.runtime.CFPage.CreateObject(CFPage.java:4827)
at coldfusion.runtime.CFPage.CreateObject(CFPage.java:4841)
at cfController2ecfc1193051760$funcINIT.runFunction(D:\websites\wwwroot\common\coldbox\system\web\Controller.cfc:62)
Let me know if any of you recognize any of these errors or if you have any ideas. I'm going to keep running tests to see if I can narrow down what causes these behaviors.
Thanks!
~Brad