RE: [ColdBox 3.5 Final] Strange behaviors after upgrade

59 views
Skip to first unread message

br...@bradwood.com

unread,
Jun 1, 2012, 7:42:04 PM6/1/12
to ColdBox List
Ok, after a significant amount of testing I believe I have determined the source of my server instabilities after our upgrade to 3.5 as well as the change in 3.5 that triggered it.

To start with, I devised a jMeter test that only hit the ColdBox sample applications randomly at about 5 req/sec to test outside of my application.  I am attaching the .jmx file for anyone who wants to try it.  (You'll want to edit the property defined in the "test Plan" node that sets the server used for all the HTTP samplers.)  I then monitored running requests, memory usage, and stack traces with SeeFusion.

The symptoms were completely reproducible hitting the ColdBox sample apps with ColdBox 3.5 and when I rolled the ColdBox version back to 3.1, I could run the test overnight without a blip in performance.

To review, the symptoms where hanging requests (always trying to access the CGI scope, or path info which is pulled from IIS via the JRUN connector), high memory usage, and the server eventually refusing requests and having to be restarted.

If you open the attached image called SeeFusion graphs.png, you will see a test using CB 3.1 and 3.5 side-by-side. 
  • In the 3.1 test, it was run overnight with a steady 5-6 requests processed a minute and memory showed a healthy saw-tooth pattern that did not climb. 
  • In the 3.5 test, the number of running requests tapered off over the course of an hour until the server was unresponsive.  Memory usage was erratic and climbed very high during the test. 
    Also, not shown in this screenshot, are 3 requests that hung for 5 minutes apiece during the 3.5 test.  Open the attached file "Example long-running request with ColdBox 3.5.txt" to see an example of what they were stuck on while they hung.  You can see that CF is trying to connect to IIS to get the real path of the template being processed.
The breakthrough came when I opened up the CF Server Monitor to take a look at the CFThreads that were running (SeeFusion doesn't show these unless you look at the full stack trace).
Look at the attached image called "CFThread queues with ColdBox 3.5.png" and you will see that literally thousands of CFThreads were being spawned by ColdBox, all but 10 of which were being queued of course.  After about 5 minutes, the queues would clear out, but memory usage would climb substantially during this time and performance would degrade.

To figure out what was going on during those spikes, refer to the attached file "example full thread dump with ColdBox 3.5.txt" and look at thread "cfthread-3".  You'll see this was a thread spawned to reap the object store in CacheBox.  Before it reaped, it tapped LogBox to log its operation which hit the asychRollingFileAppender.  The file appender decided to rotate the log file and when it tried to create the zip.cfc utility, it hung trying to access the CGI scope, which sort of makes sense since we're inside of a CFThread, and those don't have access to the CGI scope after the initial HTTP request has closed when connecting to the web server via a connector.  (http://help.adobe.com/en_US/ColdFusion/9.0/CFMLRef/WSc3ff6d0ea77859461172e0811cbec22c24-75a8.html)

When the logging CacheBox reap hung, it help open its exclusive named lock on my CacheBox instance, which in turn, blocked every other CFThread that was spawned in an attempt to reap as well. See the following threads in the full thread dump: cfthread-11, cfthread-16, cfthread-15, cfthread-14, cfthread-8, cfthread-12, cfthread-10, and cfthread-7.

The long-running requests I was seeing SeeFusion which weren't threads would happen at the same time.  I'm not sure why since a regular HTTP thread should have access to CGI vars, but I assume it was due to some sort of contention in the JRUN connector.  I'm open to the idea that we have something configured wonky there.  (CF9.01, IIS7.5)

So, what changed in ColdBox 3.5?  Good question.  The answer is the addition of lines 587 and 651 in system\cache\providers\CacheBoxProvider.cfc that log an info message.  Funny thing is, there's nothing really wrong with those two lines of code.  If you happen to have the asynch rolling file appender configured, then you'll eventually hit line 73 of system\core\util\Util.cfc which is where the reaping thread would hang every time.

instance.ioFile      = CreateObject("java","java.io.File");


What the hey, there's nothing related to the CGI scope in that line.  We're just setting something into the variables scope, and variables is searched prior to CGI.  Go back to "cfthread-3" in "example full thread dump with ColdBox 3.5.txt" and you'll see that since variables.instance doesn't exist yet, CF is searching through all the scopes it can before it gives up and creates variables.instance. 

This may be the most subtle and odd fix ever, but I was able to make all of the ColdBox 3.5 issues completely disappear by adding this line into zip.cfc
variables.instance   = {};

This defined instance in the variables scope, which stopped the built-in scope hunting from ever reaching CGI, which allowed the log file to rotate, which allowed CacheBox to log its reap, which didn't backlog thousands of reap threads, that didn't eat up all my memory and cause other random requests to to also hang when accessing the IIS connectors.

Takeaways:
  • For starters, I'd say that CF shouldn't be trying to scope hunt in the CGI scope when it's in a CFThread.  I'm not sure why CF is hanging when it does, but I'd love to get feedback on anything that might be screwy with our JRUN/IIS connectors
  • My change to zip.cfc doesn't really fix the issue, it just avoids it really.  It makes me nervous that anything that kicks off scope-hunting inside a CFThread can have this kind of results.
  • In theory, CacheBox also shouldn't spawn thousands of reaping threads.  Perhaps the object store should save some sort of state to say, "Hey, I'm reaping now, no more threads please."  Then, when reaping completes, reset that flag back.  In my instance, the object store would still never get reaped since the thread eventually errors out, but at least it wouldn't create thousands of waiting threads.
  • Anything else I'm missing, or any input people want to share?  I'd love to see some people grab the jMeter test and give it a try.  I want to know if the CGI access inside a thread hanging like that is normal or not.
Thanks!

~Brad
-------- Original Message --------
Subject: [ColdBox 3.5 Final] Strange behaviors after upgrade
From: <br...@bradwood.com>
Date: Tue, May 22, 2012 5:58 pm
To: "ColdBox List" <col...@googlegroups.com>

Hello all, after releasing ColdBox 3.5 to our production servers as part of a release last week, we began seeing very weird behaviors across all 4 of our production servers which adversely affected performance to the point where we eventually had to revert back to ColdBox 3.1 until I can figure out the source of the issues.

I haven't ruled out the possibility of something stupid going on in my own code base, but I ran many hours of jmeter tests and can very reliably produce the behaviors with 3.5 and they completely go away as soon as I swap out for the previous version of ColdBox.  I wanted to see if anyone else has seen these issue at all.  My environment is 64-bit Windows, CF9 Ent, IIS7.

The three main behaviors I saw were:

  • Randomly hanging requests.  These would pop up within 20-30 minutes of any load on the server, and affected any page on the site.  They would time out after about 5 minutes.  A number of different pages would hang, but in EVERY case, the line of code was accessing a CGI-scoped variable and the top part of the stack trace was identical.  My understanding of CGI vars is that CF talks to IIS over the JRUN connector and asks it for the headers on the request.  The threads were all stuck at a socketRead() and could not be killed (though they would time out eventually)

Full thread dump Java HotSpot(TM) 64-Bit Server VM (14.3-b01):"jrpp-6" runnable
            at java.net.SocketInputStream.socketRead0(Native Method)
            at java.net.SocketInputStream.read(SocketInputStream.java:129)
            at java.io.BufferedInputStream.read1(BufferedInputStream.java:256)
            at java.io.BufferedInputStream.read(BufferedInputStream.java:317)
            at jrun.servlet.jrpp.ProxyEndpoint.readFully(ProxyEndpoint.java:581)
            at jrun.servlet.jrpp.ProxyEndpoint.readFully(ProxyEndpoint.java:573)
            at jrun.servlet.jrpp.ProxyEndpoint.readString(ProxyEndpoint.java:622)
            at jrun.servlet.jrpp.ProxyEndpoint.getHeader(ProxyEndpoint.java:762)
            at javax.servlet.http.HttpServletRequestWrapper.getHeader(HttpServletRequestWrapper.java:124)
            at javax.servlet.http.HttpServletRequestWrapper.getHeader(HttpServletRequestWrapper.java:124)
            at coldfusion.runtime.CgiScope.resolve(CgiScope.java:221)
            at coldfusion.runtime.CgiScope.resolveName(CgiScope.java:274)
            at coldfusion.runtime.Scope.get(Scope.java:60)
            at coldfusion.runtime.ScopeSearchResult.searchForSubkeys(ScopeSearchResult.java:104)
            at coldfusion.runtime.LocalScope.search(LocalScope.java:417)
            at coldfusion.runtime.NeoPageContext.searchScopes(NeoPageContext.java:738)
            at coldfusion.runtime.NeoPageContext.searchScopes(NeoPageContext.java:662)
            at coldfusion.runtime.CfJspPage.resolveCanonicalName(CfJspPage.java:1718)
            at coldfusion.runtime.CfJspPage._resolve(CfJspPage.java:1677)
            at coldfusion.runtime.CfJspPage._resolveAndAutoscalarize(CfJspPage.java:1812)
            at coldfusion.runtime.CfJspPage._resolveAndAutoscalarize(CfJspPage.java:1805)

  • Higher-than-normal memory usage  My servers are normally happy around 1.5 to 2.5 GBs of heap used.  Memory usage jumped to 4 - 6GB of heap in use.  (Total heap is 8GB)  I think there's a chance the high memory usage was caused by the hanging requests, but that's just a hunch.
  • Null Pointer Exceptions on every request  The worst behavior of all is after a few hours of uptime, servers would just completely fall over and begin returning the following error on every single request.  What was interesting was our non-ColdBox legacy site on the same server would respond fine, but the ColdBox site would just begin erroring out on every page hit.  This would continue until the ColdFusion service was restarted.  The error was deep in the bowels of Java and usually seems to be related to resolving interface implemntations such as the trace below which was thrown when Controller.cfc tried to create an instance of coldbox.system.ioc.Injector.  It's interesting to note that Injector implements an interface in 3.5, but not in the previous version we rolled back to.

java.lang.NullPointerException
    at coldfusion.runtime.InterfaceTemplateProxyCache.getTemplateProxyFromPath(InterfaceTemplateProxyCache.java:125)
    at coldfusion.runtime.TemplateProxy.verifyInterfaceImpl(TemplateProxy.java:1072)
    at coldfusion.runtime.TemplateProxy.verifyInterfacesImpl(TemplateProxy.java:880)
    at coldfusion.cfc.ComponentProxyFactory.getProxy(ComponentProxyFactory.java:67)
    at coldfusion.runtime.CFPage.CreateObject(CFPage.java:4827)
    at coldfusion.runtime.CFPage.CreateObject(CFPage.java:4841)
    at cfController2ecfc1193051760$funcINIT.runFunction(D:\websites\wwwroot\common\coldbox\system\web\Controller.cfc:62)

Let me know if any of you recognize any of these errors or if you have any ideas.  I'm going to keep running tests to see if I can narrow down what causes these behaviors.

Thanks!

~Brad
Example long-running request with ColdBox 3.5.txt
CFTHread queues with ColdBox 3.5.png
example full thread dump with ColdBox 3.5.txt
hit coldbox samples.jmx
SeeFusion graphs.png

Luis Majano

unread,
Jun 1, 2012, 8:51:01 PM6/1/12
to col...@googlegroups.com
WOW, very interesting indeed Brad.

I added the fix as well. or the avoidance as well.  The CacheBox reaps only occur on frequencies, not sure if it should keep track of reaping and only reap if the other reap has not finished?

Interesting theory though, can you create a ticket on exploring reaping contention and monitoring.
--
You received this message because you are subscribed to the Google Groups "ColdBox Platform" group.
To post to this group, send email to col...@googlegroups.com
To unsubscribe from this group, send email to coldbox-u...@googlegroups.com
For more options, visit this group at http://groups-beta.google.com/group/coldbox
For News, visit http://blog.coldbox.org
For Documentation, visit http://wiki.coldbox.org

Attachments:
- Example long-running request with ColdBox 3.5.txt
- CFTHread queues with ColdBox 3.5.png
- example full thread dump with ColdBox 3.5.txt
- hit coldbox samples.jmx
- SeeFusion graphs.png

Luis Majano

unread,
Jun 1, 2012, 8:54:56 PM6/1/12
to col...@googlegroups.com
Brad, I committed the changes to the dev branch. Can you try now?

Andrew Scott

unread,
Jun 2, 2012, 1:28:26 AM6/2/12
to col...@googlegroups.com
Can I also add that I am sure this was also the cause of my test failing sometime ago as well, with ORM stuff. I too had an issue with the caching, where I would get error messages or not, sometimes IIS hung sometimes ColdFusion hung, all with different error messages.

But the most common was with too many concurrent connections to the DB using ORM.

There was a very lengthy thread on this around August or September, and I am wondering if it is related as well. As Earlier versions of ColdBox also did not have this issue, in the end I gave up because the site wasn't going to reach the amount of traffic.

-- 
Regards,
Andrew Scott
WebSite: http://www.andyscott.id.au/

br...@bradwood.com

unread,
Jun 2, 2012, 5:01:58 PM6/2/12
to col...@googlegroups.com
Thanks for incorporating the fix quickly.  I'll give the bleeding edge a whirl Monday morning when I get back to work.

I created ticket #1368 for the multiple reaping threads behavior.  Also, I think ticket #1321 is quite possibly caused by the same issue.

br...@bradwood.com

unread,
Jun 4, 2012, 5:58:48 PM6/4/12
to col...@googlegroups.com
Luis, I've been hitting the sample apps as well as my company's app on my stage box with jMeter all day long against bleeding edge and everything is running very smoothly with no signs of hung requests, queued threads, or high memory usage.

Ironically, at first I didn't think your commit would fix it at first since you didn't preface instance with "variables", so I thought it would still "scope hunt' on the line
instance = structnew();

That doesn't seem to be the case though, and the addition of that line has prevented zip.cfc from trying to look through the CGI scope and hanging inside a cfthread.

Luis Majano

unread,
Jun 4, 2012, 6:09:15 PM6/4/12
to col...@googlegroups.com
Cool.  

Yes, the usage of "variables." is unnecessary as that is the default implicit scope.  It actually is much faster to NOT use it than to use it as if you do then ColdFusion looks for "variables.variables.name" first.

I use it whenever I have the same variable name in either variables, local or arguments scope to distinguish them.
Reply all
Reply to author
Forward
0 new messages