Joe, sorry to hear of your challenge. And while FR can and should be able to help, it can’t in all cases. Still, I do think there’s more you can leverage than you may realize.
First, though, let’s clarify that the FRAM heartbeat log is of no value in this case. That’s watching the FRAM instance, not your CF instance. You really have almost no need to watch the FRAM logs. Watch instead the CF instance’s logs ([fr]\instance\[instancename]\log). Of special value can be the resource-0.log (or other numbers as it may fill and rolls over). See the columns after the REQ column which give insight into the number of requests running and completing at each 5-second interval. If ever that requests running count hits your limits for “max simultaneous requests” in the CF Admin “Request Tuning” page, no more requests can run which can make it seem that CF is hung. See the online help (in the FR interface) for more details on that log. There are also additional troubleshooting resources on the FR support site to help with interpreting the info FR can provide.
Second, you don’t need to rely on the logs, as the FR CP alerts can notify you as well. You mention that you had set up the long running and low memory alerts, and that’s great that you’re leveraging the alerts. Many miss them (and the docs above, so are often not fully leveraging some of the best capabilities of FR to help with such troubleshooting.)
Still, I’d argue that the generally far more valuable alert is the one you did not setup (or at least did not mention): running requests. Let’s say your Max Simultaneous Requests number was set to 20. You ought to set this running requests alert to 18, so that if ever 18 requests are running at once, you get an alert.
If you get that alert, and the requests are taking a long time, this will be useful to diagnose why they were running long, because again if it hits 20 (if that’s your limit in the CF Admin), then no other requests can run until there are free slots under that 20 limit.) The alert would include a thread dump, including details for each running request, which will often help identify and resolve what’s making requests hang up. Again, there’s info in the docs and support resources for interpreting stack traces.
And if you get lots of running requests alerts at 18, but they always show the requests being only seconds old or faster, then that’s a sign that the limit of 20 isn’t big enough on your server, so you ought to increase it. Despite the old rule of thumb about it being 2-3 times the number of CPUs, I find that has no connection to reality, and some people can run 50 or 100 requests at once. The key is: what are they doing. If most requests are running CFML that’s sitting around waiting for some resource (like a database call, or a CFHTTP request, or calling a web service), then having more running at once takes no resources. Again, the alert can clue you into what the requests are doing when many are running at once. (Again, too, if they are NOT running long when you get the alert, just ignore it until you get one that DOES show requests running a long time.)
Finally, you mention FR stopping writing its logs, and you got an memory alert. First, what is your memory alert value set it? Hopefully it’s something like 20 (meaning alert when 20% is free) not 80 (which some do by mistake, thinking it’s when 80% is used.) Second, you refer to the CF out log. Good for you for knowing about that. You say you see it showing an FR thread dump. More important, do you see any outofmemory error in that log. Search the file for that string (outofmemory). (If you may find it hard to sift through the logs, because they are filling because of the dumps that FR puts there on each CP alert or stack trace done in the interface, note that it no longer does that as of FR 4, so you may want to upgrade.)
It’s vital to know if an OOM condition is happening, and if so, what kind of outofmemory error it is.
If the OOM error says “heap space” or “gc overhead limit reached”, that would mean that your heap is indeed filling. As tempting as it may be to increase your heap, there may be some other explanation for why it’s filling. Perhaps it’s high rate of session creation, or high use of the query cache, etc. If you have the FR Extension for CF (FREC, free for use with FR 4 and above, another reason to upgrade for those still on FR 3), it creates yet another FR log, the realtimestats.log, which tracks such things as the count of sessions, the count of query cache, which can be valuable. Even if the logs stop being written, you may see some upward trend before that. Not even the CF Enterprise Server Monitor (in 8 and above) logs this sort of info, so it’s a powerful weapon in the FR arsenal.
(Also, if the heap is filling, you may not even be able to increase its size, if you’re still on a 32-bit OS, where you can’t raise it much about 1.3gb.)
If the OOM error says instead “unable to create new native thread”, or “out of swap space”, that may well mean that you should lower the heap, as each of those can mean that there’s pressure either within the address space (in the first case) or within the OS (in the second case, which may happen if other apps are running on the box and competing with CF for overall system memory.)
So no, in some cases, problems cannot be “solved” by FR, nor can some even be “detected” or “protected against” by FR (such as some of these OOM errors.) But I hope that the info above may help you as you try to explore the problem.
And if you might be challenged to further interpret the info identified or might be pressed for time and need more immediate and direct assistance, note that there is available consulting support from Intergral, the makers of FR, at cfconsultant.com. (I’m one of the consultants who may help you if you acquire that service. I’m not an FR employee but an independent consultant who also helps people solve CF server problems, and help people work with FR to help with that nearly every day.)
That said, we here on the list (the Intergral folks and others) enjoy helping each other out, so if you have more questions, fire away.
/charlie
--
You received this message because you are subscribed to the Google Groups "FusionReactor" group.
To view this discussion on the web visit https://groups.google.com/d/msg/fusionreactor/-/DFPnBBdIZKUJ.
To post to this group, send email to fusion...@googlegroups.com.
To unsubscribe from this group, send email to fusionreacto...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/fusionreactor?hl=en.
Joe
Great to hear that it was helpful.
That said, it’s generally critical that you clarify exactly WHAT follows the outofmemory errors (on the same line, whether “heap space”, gc overhead limit reached”, “out of swap space”, etc.) If you see nothing on the same line, keep searching the log around the time of the error for another line which should identify that. (Then again, you are on CF 7, which is Java 1.4, and I seem to recall noticing that sometimes that version of the JVM did not offer details in some cases.)
It may be that on such an old VM, you may have to guess at what the problem could be. Again, the realtimestats.log (from FREC) could help perhaps spot some common problems (high sessions, high query caching), or their lack of them could point to a different problem. You could also try raising the heap, but since you don’t know for sure if you are hitting that limit, it could be that raising it could cause other problems. How much total memory do you have on the box? And under a steady state, how much memory is CF using, and more important, how much does the OR report is unused?
As for being unable to locate realtimestats.log, it’s important to note if you added that to the FR instance within CF. Do you recall being asked that? And did you restart CF since doing that? (It does not do it on its own, like the main installer of FR does.) You can look in the Active Plugins page of the FR interface (in 4+) and see if it lists the FR CF plugin (among the few that will be shown.)
/charlie
--
You received this message because you are subscribed to the Google Groups "FusionReactor" group.
To view this discussion on the web visit https://groups.google.com/d/msg/fusionreactor/-/0Kz1jT3qncgJ.
Joe, a few things.
First, as for your FREC logs, maybe there is some limit of how well they work with JVM 1.4. I’ll have to leave the FR folks to reply to that.
Second, rather than focus on “all the threads” (as it seems you may be doing as you look for problems in your alerts), I’ll suggest something that may help you or other readers following this thread.
You can make the job of analysis a lot easier if you focus only on the thread ids that correspond to those listed at the top of the FR CP Alert, where it shows what requests were running, and for how long. Focus on those running longest, as they are listed in descending duration order.
Get the requestid (the left-most number in the lines showing each running requests at the top) for a long-running request (running at least for a few seconds according to the report), then do a find (within the alert) for that number, which will generally take you to the bottom of the alert, where you can now see those same requests, but now as a set of about a dozen lines with details for each request.
One line is the requestid. Note the “execution time”, and again as long as it’s more than a few seconds, note the threadId (another line in that report), then search the alert for that threadId, which will take you back up into the middle of the alert where the stack traces for each thread are listed (there will be hundreds, but we’re only interested in those few that are associated with long-running CF Requests.) Then focus on what the stack trace (like you showed below) says for at least those long-running requests.
And yes, you might find them or many in that state, and that may suggest a problem with the DB, but really you may sometimes find that the longest-running is a real culprit and the rest are hurt because of it (or not). I’ll also note that your stack trace actually shows it waiting for locking within the connection pool. That could indicate that it’s not “waiting for the database” (yet) but instead “waiting for a connection from the connection pool” (though not necessarily), which might suggest all the more that there is one (longer-running than it) that started things.
I’ll note as well that there can be another thread doing db i/o that would NOT be listed at the top of the FR CP alert: that’s a scheduler-nn thread (where nn is some number). Scheduler threads are NOT related at all to CF scheduled tasks (as some would reasonably think) but instead are background tasks that CF runs to do various things. One of those is to run the client storage purge, which many people set the CF Admin to use a database for, and in that case it could be that there’s a long-running scheduler-nn thread that could be perhaps causing some contention (for the DB or for the connection pool, less likely). Just throwing that out for you and others, as you try to read CP alerts and solve problems.
Indeed, along those lines, note that sometimes you may even want to ignore a CP alert. Let’s say you told FR to track the “running requests” alert, when a certain number of requests are running (as in your case 18). Just because you get the alert doesn’t mean there’s any problem at all: you may get an alert showing several or even dozens of requests “running”, but if you look closely at that list of them at the top, you may see that they are all running less than a second or a few seconds. In that case, I’d argue there may be no real value in looking further into what they are doing. Just let it go and wait until you do get an alert showing some or many long-running requests, and then look into them.
Finally, I’ll note that you have shown in your error log below some cookie-related errors, which on the surface may seem innocuous, but in fact they sometimes reflect the fact that a bunch of automated requests from search engine spiders and bots (or other automated tools, etc) may have made a flush of requests. It could be that your problem is really caused by that. We see it all the time. In that case, look at the FR logs resource-nn.log, where you can see a column that tracks every 5 seconds how many requests are running and have run in that past 5 seconds. That can be great for spotting a sudden dramatic rise in requests, which may be the root cause. Then you can look at the FR request-nn.log to see the actual URLs, IP addresses, and more for each request run at that given time.
These are all the sort of things that I and the Intergral folks use daily in helping people solve CF server problems. There’s nearly always an explanation for what’s amiss, and it’s often not at all what someone expected (or what many in the community propose are “common problems”.) With the right tools, you can know what’s amiss, based on diagnostics pointing you to the problems, rather than guessing and trying things to see what may work (as many seem to recommend). All that said, while FR can’t always “protect you” from problems, it can certainly significantly aid you, whether with its interface, its logs, its alerts, etc.
Hope that is helpful.
/charlie
From: fusion...@googlegroups.com [mailto:fusion...@googlegroups.com] On Behalf Of Joe Copley
Sent: Monday, November 19, 2012 5:13 PM
To: fusion...@googlegroups.com
Subject: Re: [fusionreactor] jrun connection closed errors
Hi Charlie:
Were you able to see what the request was doing that was running in the alert (according to the stack trace for the request, shown within the thread dump of the alert)?
Also, you say you wish FRAM could trigger a thread dump. But FR did, within the JVM itself, in the FR CP alert. I suppose there could be a time when a thread dump could be useful other than when an alert condition arises. But I’m pretty sure that an external process (FRAM) could not trigger it unless you had configured the instance to be monitored (CF in your case), such as with RMI. Though as you indicate, you can also do it from the command line.
That said, I would think that if you can get it to respond to that thread dump request, you may well be able to get into the FR interface (in the instance, not FRAM) and get it to take a thread dump (under the Resources>All Threads) page.
You mention finding OOM errors in the err.log, but without a date/time stamp. What do you see in the out-log? Again, I realize that in CF7 things may differ from later releases, but usually the out log DOES have a timestamp for the error. Not also that sometimes you see 2 OOM messages for each such error, one at the top of a bunch of messages, and one at the bottom, and (in the out log I’m saying) there is usually a time stamp on one of them.
Also, I don’t think the “connection closed” is at all significant, at least from my experience. The simple question (if you’re saying that CF is becoming unresponsive) is what’s causing that. It isn’t unusual for it to be the JVM going nuts as it’s trying to shutdown—and again there then the real question is why that is happening. The thread dump may help, if in fact that “one running request” is in fact the criminal here. But in my experience it’s rare that any 1 request is the real root cause for a problem.
Instead, there is nearly always some other explanation, generally from some (or many or thousands or millions) of requests which have contributed to resource consumption over time(very often spiders and bots).
Again, a tool like FR can’t say “here’s the problem and the solution”, but it can provide the diagnostics needed to help. I’ll stress again the value of the FR resource log (in the FR/CF instance logs) to help know more about what sort of request activity was happening around (and before) the time of the problem. It also tracks the JVM heap use percent.
There are also various logs that track different uses of other parts of JVM memory (in the FR logs), but I do think they vary based on the JVM and I seem to recall seeing that they have more on JVM 1.5, and CF7 comes with JVM 1.4 (CF8/9/10 come with 1.6).
Finally, you don’t need to slog through this on your own. There are consulting services available from the folks behind FusionReactor (at cfconsultant.com, and mentioned also on the fusion-reactor.com pages.)
/charlie
From: fusion...@googlegroups.com [mailto:fusion...@googlegroups.com] On Behalf Of Joe Copley
Sent: Friday, November 23, 2012 4:07 PM
To: fusion...@googlegroups.com
Subject: Re: [fusionreactor] jrun connection closed errors
Hi Charlie: