Waiting on lock issue

Alex DeMarco

unread,

Oct 11, 2017, 8:45:17 AM10/11/17

to FusionReactor

We have been tracking an issue where all of the sudden all the web requests start slowing down with this showing in FR7:

100.0% Waiting on lock <0xadaf133> (a java.util.concurrent.Semaphore$FairSync)

And then suddenly it clears and the box takes off again. We are running CF 11 update 11 win 2008r2.

Anyone run into something like this? Any pointers? I have not been able to narrow down the root cause.

- Alex

Alex DeMarco

unread,

Oct 11, 2017, 8:50:08 AM10/11/17

to FusionReactor

One thing is that timing seems to coincide with garbage collection.

michael...@intergral.com

unread,

Oct 11, 2017, 11:42:17 AM10/11/17

to FusionReactor

Hi Alex,

I have seen before that servers can become unresponsive during garbage collection if a particular memory space has no available resource.

I would recommend that the following be checked:

Coldfusion-error log file. This would indicate a memory space is out of memory
Checking the garbage collection over an extended period of time to see if the rate of garbage collection is increasing
Checking if there is a large CPU spike when garbage collection occurs

If you post your findings here hopefully we may be able to track down the issue.

Hope this helps.

Michael Flewitt

FusionReactor Support Team

charlie arehart

unread,

Oct 13, 2017, 5:17:57 PM10/13/17

to fusion...@googlegroups.com

Alex, did you find anything?

If you are still searching, I think you may be surprised to hear what I am about to say: from my experience, requests in that state you mention (waiting for java.util.concurrent.Semaphore$FairSync) are not themselves anything to worry about or focus on. Why? Because those reflect QUEUED requests in CF. You want to instead find out if there are any RUNNING requests and find out why THEY are running., which can of course be for any of many reasons, including perhaps the GC issue Michael referred to. The great news is that FR can help you know down to the line of code what a request is doing, either currently or in the past.

Let me explain, and while you and Michael may already know much of this, I’ll say it for others reading along.

To clarify about queued requests, once the number of running requests reaches the value set in the CF Admin for “max simultaneous requests”, then CF will start queuing requests until the number running goes back to below that limit. Again, you would want to know why THOSE were running (not why others are queuing, as that’s secondary).

So I have several questions and suggestions that may help you diagnose things:

At the time you are or were having this issue, did you see FR showing any running requests (in the Requests>Activity page)? Or if CF has not been restarted since this happened, you can see a count of running and requests per second, for up to a week ago, on the Requests>Activity graph page. Just beware that for the week view, it shows averages over an hour, and you could lose the fidelity of a brief moment of running requests, leading to queuing
Better, if you want to the FR logs, and the resource log, you could see if the 3^rd value after REQ column was non-zero. If it is and reaches the “max simult requests” value in the Admin, then you would have queued requests, and again you’d want to know what the ones running were doing.
As for the running requests, and what they are doing, if you could catch the requests while running, you could see them in the Requests>Activity page and dig into their details (to see if they are waiting for DB calls, or other “relations” shown there, or CPU time), and while running you can use a stack trace to see what it’s doing down to the java method. Finally, for running or completed requests, if you have FR Ultimate you can also use the “profiler” to see where the most time was spent, down to the Java method. (And yep, as Michael said, it could be that requests are slow due to GCs, but there can of course be many, many other reasons why requests are slow, and sometimes intermittently, either unique to that request, or that app, or the DB its talking to, or a cfhttp to some other server, and so on.)
Even if the request was in the past, you can see what’s going on for requests that are running if you have setup the FR Crash Protection email alert, especially the “request quantity” alert, which fires when more than x requests have been running throughout a period of y seconds”. In that alert, you would see the URLs of what was running, and their IP address, and duration, and even any query they may be running, and more. But most important (if it’s not running a long-running query), you can see for each request the thread dump of requests which includes a stack trace of what it’s doing. I have a video on youtube and past FR webinars where I show connecting those dots.
Indeed, you refer to this java method you saw the requests running. So how did you see that? From an alert? From a stack track of a single running request? From a thread dump of all requests (using “stack trace all” on the Requests>Activity page, or in Resources>Threads)?
And note that FR DOES track if requests are queued in CF. Sadly, it doesn’t get tracked in the FR UI currently (not even on the Metrics>CF Metrics page where there is a graph about it), at least not unless you are on CF Enterprise with “Start Monitoring” enabled in the CF Enterprise Server Monitor). I have filed a feature request for that, because I know that CF does track that information, for all versions, and FR could get it.
Indeed, I’m pretty sure I’ve seen that the FR realtimestats.log column for the count of requests running AND queued is tracked, regardless of that. You could look back in history in those logs to see if (at the time you reported the below) it confirmed a) that there were queued requests, and b) if there were running requests. (But like I said above, you can easily see the count of RUNNING requests in the resource log, also. The resource log applies to all java servers that FR can monitor, but the realtimestats.log gets info only from CF.)
One other thing you may find, especially if you have FR CP alerts showing when this happened: the requests that you found in this queued state (running that java.util.concurrent.Semaphore$FairSync method) will probably be the most recently running requests in the list of running requests at the top.
I will add also that currently FR does not distinguish running vs queued requests in the Requests>Activity page or the Metrics>Web Metrics. Same for the list of “running requests” at the top of a CP alert. So you may find those suggest you have 100 or more “running” requests even though your CF admin “max simult requests” value is like 25 or 50. Again, I am hoping that in the future FR would better distinguish such queued requests, as others have been confused by this.

Most important, I’ll be interested to hear if this may explain what you are seeing, or if you feel you are seeing something else. And even if you would say you had already solved your problem (and if so, please tell us what it was), I do hope again that the info above may help some readers.

/charlie

From: fusion...@googlegroups.com [mailto:fusion...@googlegroups.com] On Behalf Of michael...@intergral.com
Sent: Wednesday, October 11, 2017 10:42 AM
To: FusionReactor <fusion...@googlegroups.com>
Subject: [fusionreactor] Re: Waiting on lock issue

Hi Alex,

I have seen before that servers can become unresponsive during garbage collection if a particular memory space has no available resource.

<snip>

DeMarco, Alex

unread,

Oct 16, 2017, 7:34:27 AM10/16/17

to fusion...@googlegroups.com

Yes, this is what we found out.

We are using FR with Weblogic 11g. Our developers had a nightly indexing job that was “running fine for a year”. They had constantly told us that they had made no code changes at all. However, what they failed to mention was that they increased the number of nightly indexes they were performing. In the past it was only 4 however, recently they doubled it to 8. When I say indexing they are pulling various artifacts out of an Oracle database and then populating 8 SOLR indexes.

Our JDBC configuration was set to time out idle connections at 2 minutes.

What was happening is that the additional indexes were pushing the idle time over 2 minutes and oracle was coming along and hanging up the connection and treating it as if it was leaked or that the code was missing the close statement. Of course I was chasing this issue without knowing the # of indexes had increased.

If I remove the jdbc timeout and set it to zero the issue goes away. However, I am not comfortable in doing this (am I wrong is this ??)

I have not had time yet to sit down and review exactly how the code runs this process.

- Alex

--
You received this message because you are subscribed to the Google Groups "FusionReactor" group.
To unsubscribe from this group and stop receiving emails from it, send an email to fusionreacto...@googlegroups.com.
To post to this group, send email to fusion...@googlegroups.com.
Visit this group at https://groups.google.com/group/fusionreactor.
For more options, visit https://groups.google.com/d/optout.

charlie arehart

unread,

Oct 16, 2017, 3:27:38 PM10/16/17

to fusion...@googlegroups.com

Well, this just raises more questions (to me). :-)

1) Let’s clarify first: were these “indexing” jobs running in CF (since they grab Oracle data and populate a Solr index)? If so, are you saying you were finding that the indexing jobs were taking longer simply because there are twice as many operations?

2) Second, you start out saying “yes, this is what we found out”. So was that referring to the request queuing, which I was saying was what you were really seeing in the stack trace you offered? And if you were agreeing that yes, those were queued requests, the second point I was making is that queuing happens because some requests were hanging.

3) So have you confirmed that the indexing jobs were taking longer? And assuming they are CF tasks, is it one request doing more or is each its own request? If the latter, do they run concurrently or consecutively?

4) Either way, it could instead be simply that the increase in the number of indexing jobs (consecutively or concurrently) simply led to a longer total duration of time for the indexing operations, during which the DB was locking records (for concurrency and reliability), and thus causing contention for OTHER requests (from CF) awaiting that contention. So for instance, if the indexing jobs read many records, perhaps they were causing contention with CF requests wanting to update records that were being read by the indexing job.

5) Finally, one reason I’m pressing this is you say, “the additional indexes were pushing the idle time over 2 minutes”. Well, “idle” time for a CF datasource definition refers to the duration of time a connection thread is detected to “not be doing anything anymore” and so is returned to the connection pool.

But since you are implying that “adding index jobs” means doing more work, I’m not seeing readily how that would lead to increased idle time?

Hope something there is helpful.

/charlie

From: fusion...@googlegroups.com [mailto:fusion...@googlegroups.com] On Behalf Of DeMarco, Alex
Sent: Monday, October 16, 2017 06:34 AM
To: fusion...@googlegroups.com
Subject: RE: [fusionreactor] Re: Waiting on lock issue

Yes, this is what we found out.

We are using FR with Weblogic 11g. Our developers had a nightly indexing job that was “running fine for a year”. They had constantly told us that they had made no code changes at all. However, what they failed to mention was that they increased the number of nightly indexes they were performing. In the past it was only 4 however, recently they doubled it to 8. When I say indexing they are pulling various artifacts out of an Oracle database and then populating 8 SOLR indexes.

Our JDBC configuration was set to time out idle connections at 2 minutes.

What was happening is that the additional indexes were pushing the idle time over 2 minutes and oracle was coming along and hanging up the connection and treating it as if it was leaked or that the code was missing the close statement. Of course I was chasing this issue without knowing the # of indexes had increased.

If I remove the jdbc timeout and set it to zero the issue goes away. However, I am not comfortable in doing this (am I wrong is this ??)

I have not had time yet to sit down and review exactly how the code runs this process.

Alex

charlie arehart

unread,

Oct 17, 2017, 2:35:16 PM10/17/17

to fusion...@googlegroups.com

Alex?

/charlie

From: fusion...@googlegroups.com [mailto:fusion...@googlegroups.com] On Behalf Of charlie arehart
Sent: Monday, October 16, 2017 02:28 PM
To: fusion...@googlegroups.com
Subject: RE: [fusionreactor] Re: Waiting on lock issue

Well, this just raises more questions (to me). :-)

<snip>

DeMarco, Alex

unread,

Oct 17, 2017, 2:51:12 PM10/17/17

to fusion...@googlegroups.com

Sorry for the delay but have had other issues going on.

So again this is NOT coldfusion code or solr within coldfusion.

This is a JAVA based application.

The index processes are running in Java to pull records from oracle and push to SOLR. They recently doubled the number of indexes they were creating/rebuilding.

The job in question has the ability to run each index separately or ALL. If we do them separately it completes without issue. If we do them all, it eventually failes with a connection is closed error. If I set the timeout to zero the job always completes..

They are also using hibernate which is what is throwing the connection is closed error.

It seems that when the process starts it binds to a jdbc connection, and then when the last record is pushed to SOLR, they complete the transaction and try to close the connection. When they do them separately it works fine. But if they do them ALL then the transaction goes to finish and can’t because the jdbc connection it had, was closed by Oracle due to the 120second timeout value.

- Alex

From: fusion...@googlegroups.com [mailto:fusion...@googlegroups.com] On Behalf Of charlie arehart
Sent: Monday, October 16, 2017 3:28 PM
To: fusion...@googlegroups.com
Subject: RE: [fusionreactor] Re: Waiting on lock issue

Well, this just raises more questions (to me). :-)

1) Let’s clarify first: were these “indexing” jobs running in CF (since they grab Oracle data and populate a Solr index)? If so, are you saying you were finding that the indexing jobs were taking longer simply because there are twice as many operations?

2) Second, you start out saying “yes, this is what we found out”. So was that referring to the request queuing, which I was saying was what you were really seeing in the stack trace you offered? And if you were agreeing that yes, those were queued requests, the second point I was making is that queuing happens because some requests were hanging.

3) So have you confirmed that the indexing jobs were taking longer? And assuming they are CF tasks, is it one request doing more or is each its own request? If the latter, do they run concurrently or consecutively?

4) Either way, it could instead be simply that the increase in the number of indexing jobs (consecutively or concurrently) simply led to a longer total duration of time for the indexing operations, during which the DB was locking records (for concurrency and reliability), and thus causing contention for OTHER requests (from CF) awaiting that contention. So for instance, if the indexing jobs read many records, perhaps they were causing contention with CF requests wanting to update records that were being read by the indexing job.

5) Finally, one reason I’m pressing this is you say, “the additional indexes were pushing the idle time over 2 minutes”. Well, “idle” time for a CF datasource definition refers to the duration of time a connection thread is detected to “not be doing anything anymore” and so is returned to the connection pool.

But since you are implying that “adding index jobs” means doing more work, I’m not seeing readily how that would lead to increased idle time?

Hope something there is helpful.

/charlie

From: fusion...@googlegroups.com [mailto:fusion...@googlegroups.com] On Behalf Of DeMarco, Alex
Sent: Monday, October 16, 2017 06:34 AM
To: fusion...@googlegroups.com
Subject: RE: [fusionreactor] Re: Waiting on lock issue

Yes, this is what we found out.

We are using FR with Weblogic 11g. Our developers had a nightly indexing job that was “running fine for a year”. They had constantly told us that they had made no code changes at all. However, what they failed to mention was that they increased the number of nightly indexes they were performing. In the past it was only 4 however, recently they doubled it to 8. When I say indexing they are pulling various artifacts out of an Oracle database and then populating 8 SOLR indexes.

Our JDBC configuration was set to time out idle connections at 2 minutes.

What was happening is that the additional indexes were pushing the idle time over 2 minutes and oracle was coming along and hanging up the connection and treating it as if it was leaked or that the code was missing the close statement. Of course I was chasing this issue without knowing the # of indexes had increased.

If I remove the jdbc timeout and set it to zero the issue goes away. However, I am not comfortable in doing this (am I wrong is this ??)

I have not had time yet to sit down and review exactly how the code runs this process.

- Alex

charlie arehart

unread,

Oct 17, 2017, 3:53:50 PM10/17/17

to fusion...@googlegroups.com

OK, thanks for clarifying. And no problem on the delay. Stuff happens. It was just a friendly ping, as I am interested to see this resolved with you.

(Before proceeding to the substantive discussion of the issue, I would note that if you may mean to imply that it should have been obvious this “was not CF” because you said “We are using FR with Weblogic 11g”, I’ll just note that one can of course run CFML and CF on WLS and indeed most any JEE server. And as you acknowledge here, CF does also bundle Solr. And I think in the past you had been working with and talking here about CF, so I just all the more naturally assumed it.)

Anyway, the good news is that the concepts apply to any JEE server (and for any pure JEE folks here, note that CF is in fact by default a deployment of a CFML engine on Tomcat).

And yep, thanks for the specifics. So yep, this seems to be a misalignment of the Oracle sessiontimeout (the 120 seconds: would you agree that is the Oracle session timeout) and this jdbc connection pool processing. What’s odd is that again, you would not be seeming to hit an IDLE timeout because your thread is not idle. Are you 100% positive it’s the idle timeout that you are setting to 0 on the WLS side? This feels more like a timeout of how long a connection can remain open (which is not the idle time), especially because you say it’s more likely to be hit the longer the request runs (talking to the DB). But I’ll admit there could be some subtlety here I’m not seeing or that’s not clear from what you’ve shared.

And whichever timeout you say you’re setting, are you doing it in code or configuration? (JEE servers support defining the equivalent of CF datasources, as Admin-level definitions of such JDBC connection info, or of course you can do it in code on the individual JDBC connection.)

I appreciate that the fact you can “set the timeout to 0” is a workaround, and you may be inclined to leave it at that. But if you are perhaps interested in understanding the problem, I am also, and perhaps others would learn from whatever is discovered here.

/charlie

From: fusion...@googlegroups.com [mailto:fusion...@googlegroups.com] On Behalf Of DeMarco, Alex
Sent: Tuesday, October 17, 2017 01:51 PM
To: fusion...@googlegroups.com
Subject: RE: [fusionreactor] Re: Waiting on lock issue

Sorry for the delay but have had other issues going on.

So again this is NOT coldfusion code or solr within coldfusion.

This is a JAVA based application.

The index processes are running in Java to pull records from oracle and push to SOLR. They recently doubled the number of indexes they were creating/rebuilding.

The job in question has the ability to run each index separately or ALL. If we do them separately it completes without issue. If we do them all, it eventually failes with a connection is closed error. If I set the timeout to zero the job always completes..

They are also using hibernate which is what is throwing the connection is closed error.

It seems that when the process starts it binds to a jdbc connection, and then when the last record is pushed to SOLR, they complete the transaction and try to close the connection. When they do them separately it works fine. But if they do them ALL then the transaction goes to finish and can’t because the jdbc connection it had, was closed by Oracle due to the 120second timeout value.

Alex

charlie arehart

unread,

Oct 19, 2017, 3:07:08 PM10/19/17

to fusion...@googlegroups.com

Alex?

/charlie

From: fusion...@googlegroups.com [mailto:fusion...@googlegroups.com] On Behalf Of charlie arehart
Sent: Tuesday, October 17, 2017 02:54 PM
To: fusion...@googlegroups.com
Subject: RE: [fusionreactor] Re: Waiting on lock issue

OK, thanks for clarifying. And no problem on the delay. Stuff happens. It was just a friendly ping, as I am interested to see this resolved with you.

(Before proceeding to the substantive discussion of the issue, I would note that if you may mean to imply that it should have been obvious this “was not CF” because you said “We are using FR with Weblogic 11g”, I’ll just note that one can of course run CFML and CF on WLS and indeed most any JEE server. And as you acknowledge here, CF does also bundle Solr. And I think in the past you had been working with and talking here about CF, so I just all the more naturally assumed it.)

Anyway, the good news is that the concepts apply to any JEE server (and for any pure JEE folks here, note that CF is in fact by default a deployment of a CFML engine on Tomcat).

And yep, thanks for the specifics. So yep, this seems to be a misalignment of the Oracle sessiontimeout (the 120 seconds: would you agree that is the Oracle session timeout) and this jdbc connection pool processing. What’s odd is that again, you would not be seeming to hit an IDLE timeout because your thread is not idle. Are you 100% positive it’s the idle timeout that you are setting to 0 on the WLS side? This feels more like a timeout of how long a connection can remain open (which is not the idle time), especially because you say it’s more likely to be hit the longer the request runs (talking to the DB). But I’ll admit there could be some subtlety here I’m not seeing or that’s not clear from what you’ve shared.

And whichever timeout you say you’re setting, are you doing it in code or configuration? (JEE servers support defining the equivalent of CF datasources, as Admin-level definitions of such JDBC connection info, or of course you can do it in code on the individual JDBC connection.)

I appreciate that the fact you can “set the timeout to 0” is a workaround, and you may be inclined to leave it at that. But if you are perhaps interested in understanding the problem, I am also, and perhaps others would learn from whatever is discovered here.

/charlie

From: fusion...@googlegroups.com [mailto:fusion...@googlegroups.com] On Behalf Of DeMarco, Alex
Sent: Tuesday, October 17, 2017 01:51 PM
To: fusion...@googlegroups.com
Subject: RE: [fusionreactor] Re: Waiting on lock issue

Sorry for the delay but have had other issues going on.

<snip>

DeMarco, Alex

unread,

Oct 19, 2017, 3:19:15 PM10/19/17

to fusion...@googlegroups.com

Ok to be clear ( I may have misspoke before) it is the “Inactive Connection Timeout” value. if I set it to zero which is never timeout the issue goes away. If I set the timeout for longer that the entire process takes to complete (20 minutes) the issue goes away.

I am setting in the config for the datasource in question via the Weblogic Admin Console.

Unfortunately I have had a number of other production level issues that needed attention this week and have not had any time work this issue or respond to this thread in a timely fashion, since implementing the workaround.

From the Weblogic Admin Help:

Inactive Connection Timeout:

The number of inactive seconds on a reserved connection before WebLogic Server reclaims the connection and releases it back into the connection pool.

You can use the Inactive Connection Timeout feature to reclaim leaked connections - connections that were not explicitly closed by the application. Note that this feature is not intended to be used in place of properly closing connections.

- Alex

From: fusion...@googlegroups.com [mailto:fusion...@googlegroups.com] On Behalf Of charlie arehart
Sent: Tuesday, October 17, 2017 3:54 PM
To: fusion...@googlegroups.com
Subject: RE: [fusionreactor] Re: Waiting on lock issue

OK, thanks for clarifying. And no problem on the delay. Stuff happens. It was just a friendly ping, as I am interested to see this resolved with you.

(Before proceeding to the substantive discussion of the issue, I would note that if you may mean to imply that it should have been obvious this “was not CF” because you said “We are using FR with Weblogic 11g”, I’ll just note that one can of course run CFML and CF on WLS and indeed most any JEE server. And as you acknowledge here, CF does also bundle Solr. And I think in the past you had been working with and talking here about CF, so I just all the more naturally assumed it.)

Anyway, the good news is that the concepts apply to any JEE server (and for any pure JEE folks here, note that CF is in fact by default a deployment of a CFML engine on Tomcat).

And yep, thanks for the specifics. So yep, this seems to be a misalignment of the Oracle sessiontimeout (the 120 seconds: would you agree that is the Oracle session timeout) and this jdbc connection pool processing. What’s odd is that again, you would not be seeming to hit an IDLE timeout because your thread is not idle. Are you 100% positive it’s the idle timeout that you are setting to 0 on the WLS side? This feels more like a timeout of how long a connection can remain open (which is not the idle time), especially because you say it’s more likely to be hit the longer the request runs (talking to the DB). But I’ll admit there could be some subtlety here I’m not seeing or that’s not clear from what you’ve shared.

And whichever timeout you say you’re setting, are you doing it in code or configuration? (JEE servers support defining the equivalent of CF datasources, as Admin-level definitions of such JDBC connection info, or of course you can do it in code on the individual JDBC connection.)

I appreciate that the fact you can “set the timeout to 0” is a workaround, and you may be inclined to leave it at that. But if you are perhaps interested in understanding the problem, I am also, and perhaps others would learn from whatever is discovered here.

/charlie

From: fusion...@googlegroups.com [mailto:fusion...@googlegroups.com] On Behalf Of DeMarco, Alex
Sent: Tuesday, October 17, 2017 01:51 PM
To: fusion...@googlegroups.com
Subject: RE: [fusionreactor] Re: Waiting on lock issue

Sorry for the delay but have had other issues going on.

So again this is NOT coldfusion code or solr within coldfusion.

This is a JAVA based application.

The index processes are running in Java to pull records from oracle and push to SOLR. They recently doubled the number of indexes they were creating/rebuilding.

The job in question has the ability to run each index separately or ALL. If we do them separately it completes without issue. If we do them all, it eventually failes with a connection is closed error. If I set the timeout to zero the job always completes..

They are also using hibernate which is what is throwing the connection is closed error.

It seems that when the process starts it binds to a jdbc connection, and then when the last record is pushed to SOLR, they complete the transaction and try to close the connection. When they do them separately it works fine. But if they do them ALL then the transaction goes to finish and can’t because the jdbc connection it had, was closed by Oracle due to the 120second timeout value.

- Alex

charlie arehart

unread,

Oct 20, 2017, 6:21:26 AM10/20/17

to fusion...@googlegroups.com

Interesting, as that is indeed effectively an idle connection timeout. So it seems odd that setting it to 0 or “longer than the process takes” would make it go away.

But I hear what you’re saying: for now, it works, and that’s what matters. I’ll let it go also.

/charlie

From: fusion...@googlegroups.com [mailto:fusion...@googlegroups.com] On Behalf Of DeMarco, Alex
Sent: Thursday, October 19, 2017 02:19 PM
To: fusion...@googlegroups.com
Subject: RE: [fusionreactor] Re: Waiting on lock issue

Ok to be clear ( I may have misspoke before) it is the “Inactive Connection Timeout” value. if I set it to zero which is never timeout the issue goes away. If I set the timeout for longer that the entire process takes to complete (20 minutes) the issue goes away.

<snip>

Reply all

Reply to author

Forward