[cas-user] how does Inspektr inability to write to RDBMS cause SWF conversation deadlock?

54 views
Skip to first unread message

Andrew Petro

unread,
Jun 16, 2011, 9:53:24 AM6/16/11
to cas-...@lists.jasig.org
Kim,

Inspektr inability to insert records causing actions undertaken as part
of Spring Web Flow conversations to hang (causing subsequent requests to
access those conversations to wait on the lock for the conversation,
thereby tying up Tomcat request processing threads out to exhaustion and
thereby causing the CAS server to become nonresponsive) is an
interesting hypothesis.

However, I'm wondering why Inspektr's use of asynchronous database
access doesn't prevent this from tying up any conversations?

Inspektr's JDBC-using audit trail manager:

https://github.com/dima767/inspektr/blob/inspektr-1.0.0.GA/inspektr-support-spring/src/main/java/com/github/inspektr/audit/support/JdbcAuditTrailManager.java

uses

/** ExecutorService that has one thread to asynchronously save requests. */
private final ExecutorService executorService =
Executors.newSingleThreadExecutor();

to execute the logging to the database in a single asynchronous thread.


Since JdbcAuditTrailManager will in practice be a singleton, singly
instantiated from auditTrailContext.xml , and since it instantiates a
single thread ExecutorService once, there should be a singleton such
ExecutorService such that at any given time only one thread will be
working on writing Inspektr audit log entries to the database.

Is the hypothesis that these threads terminate abnormally without
returning their database connection to the pool? That's at least plausible:

http://download.oracle.com/javase/1.5.0/docs/api/java/util/concurrent/Executors.html#newSingleThreadExecutor()

I get how that could exhaust the database connection pool (sort of;
shouldn't it be testing these connections?), but I guess I still don't
see how this locks any conversations, since the database access is
happening in a separate thread from that operating on the conversation.
Mostly, audit trail logging should be able to fail in arbitrarily
horrible ways and only have the consequence of failing to log the audit
trail to the database. [1]

Is this a matter of clean()

public void clean() {
this.transactionTemplate.execute(new
TransactionCallbackWithoutResult() { ...


not using that executor, such that clean() happens synchronously? (What
calls clean(), anyway? That happening asynchronously in its own thread?
Looks like doubly an exercise left to the reader -- firstly to get
clean() to be called, and secondly default clean select SQL will select
no records to be cleaned.) I don't see offhand why clean() would be
tying up the conversation -- if it's being called, how is that happening
in a thread holding a SWF conversation?


Andrew


[1] (Maybe that's a disturbing thought, having a CAS server happily
churning along servicing requests and *not* logging the audit trail as
intended?)

On 02/14/2011 05:06 PM, Cary, Kim wrote:
> Math, hi.
>
> Our stuff is running on OSX server. I've seen several mysql databases (we put some log entries into mysql) stop accepting new entries when the db gets to the 10Gb mark. No limit imposed by us, just empirically observed.
>
> My wild guess is that maybe the inability to insert new records at that point caused a deadlock.
>
> We're using queries for the audit data from Inspektr. Saved queries are handy. There is no built in managment gui for the data, but I imagine Crystal Reports would be handy.
>
> Best,
> KC
>
> On Feb 14, 2011, at 11:38 AM, Matthew Selwood wrote:
>
>> Hi Kim,
>>
>> I�m in the process of implemented Inspektr on our CAS build here. What 10Gb threshold are you referred to? Is this a DB limit you�ve put in place?
>>
>> On a similar note, is there a �user friendly� way of viewing the database audit and stats, or do you just write queries?
>>
>> Thanks,
>> Matt
>>
>>
>> -------------------------------
>> Matthew Selwood
>> Programmer Analyst � Web Services
>> University of Victoria
>> 250.472.5565 | sel...@uvic.ca | CLE C006
>>
>> On 2/13/11 9:36 PM, "Cary, Kim"<Kim....@pepperdine.edu> wrote:
>>
>> Our CAS service came up unresponsive (connection timed out) after throwing about 6 of these errors tonite.
>>
>> Google doesn't have much on this, but one case shows some involvement with c3po failing to talk to the database. We use inspektr and our audit database hit a 10Gb threshold tonite. Restarting tomcat without making any other changes put the service back into operation.
>>
>> Can anyone share any clues on pursuing the root cause of the outage?
>>
>> Thanks,
>> KC
>>
>> Feb 13, 2011 7:12:03 PM org.apache.catalina.core.StandardWrapperValve invoke
>> SEVERE: Servlet.service() for servlet cas threw exception
>> org.springframework.webflow.conversation.impl.LockTimeoutException: Unable to acquire conversation lock after 30 seconds
>> at org.springframework.webflow.conversation.impl.JdkConcurrentConversationLock.lock(JdkConcurrentConversationLock.java:44)
>> at org.springframework.webflow.conversation.impl.ContainedConversation.lock(ContainedConversation.java:69)
>> at org.springframework.webflow.execution.repository.support.ConversationBackedFlowExecutionLock.lock(ConversationBackedFlowExecutionLock.java:51)
>> at org.springframework.webflow.executor.FlowExecutorImpl.resumeExecution(FlowExecutorImpl.java:160)
>> at org.springframework.webflow.mvc.servlet.FlowHandlerAdapter.handle(FlowHandlerAdapter.java:183)
>> at org.springframework.web.servlet.DispatcherServlet.doDispatch(DispatcherServlet.java:771)
>> at org.springframework.web.servlet.DispatcherServlet.doService(DispatcherServlet.java:716)
>> at org.springframework.web.servlet.FrameworkServlet.processRequest(FrameworkServlet.java:647)
>> at org.springframework.web.servlet.FrameworkServlet.doPost(FrameworkServlet.java:563)
>> at javax.servlet.http.HttpServlet.service(HttpServlet.java:637)
>> at javax.servlet.http.HttpServlet.service(HttpServlet.java:717)
>> at org.jasig.cas.web.init.SafeDispatcherServlet.service_aroundBody2(SafeDispatcherServlet.java:115)
>> at org.jasig.cas.web.init.SafeDispatcherServlet.service_aroundBody3$advice(SafeDispatcherServlet.java:44)
>> at org.jasig.cas.web.init.SafeDispatcherServlet.service(SafeDispatcherServlet.java:1)
>> at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:290)
>> at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
>> at com.github.inspektr.common.web.ClientInfoThreadLocalFilter.doFilter(ClientInfoThreadLocalFilter.java:46)
>> at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
>> at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
>> at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233)
>> at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191)
>> at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:128)
>> at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102)
>> at org.apache.catalina.valves.AccessLogValve.invoke(AccessLogValve.java:568)
>> at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
>> at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:286)
>> at org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:845)
>> at org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:583)
>> at org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:447)
>> at java.lang.Thread.run(Thread.java:613)
>> Feb 13, 2011 7:12:14 PM org.apache.catalina.core.StandardWrapperValve invoke
>> SEVERE: Servlet.service() for servlet cas threw exception
>> org.springframework.webflow.conversation.impl.LockTimeoutException: Unable to acquire conversation lock after 30 seconds
>> at org.springframework.webflow.conversation.impl.JdkConcurrentConversationLock.lock(JdkConcurrentConversationLock.java:44)
>> at org.springframework.webflow.conversation.impl.ContainedConversation.lock(ContainedConversation.java:69)
>> at org.springframework.webflow.execution.repository.support.ConversationBackedFlowExecutionLock.lock(ConversationBackedFlowExecutionLock.java:51)
>> at org.springframework.webflow.executor.FlowExecutorImpl.resumeExecution(FlowExecutorImpl.java:160)
>> at org.springframework.webflow.mvc.servlet.FlowHandlerAdapter.handle(FlowHandlerAdapter.java:183)
>> at org.springframework.web.servlet.DispatcherServlet.doDispatch(DispatcherServlet.java:771)
>> at org.springframework.web.servlet.DispatcherServlet.doService(DispatcherServlet.java:716)
>> at org.springframework.web.servlet.FrameworkServlet.processRequest(FrameworkServlet.java:647)
>> at org.springframework.web.servlet.FrameworkServlet.doPost(FrameworkServlet.java:563)
>> at javax.servlet.http.HttpServlet.service(HttpServlet.java:637)
>> at javax.servlet.http.HttpServlet.service(HttpServlet.java:717)
>> at org.jasig.cas.web.init.SafeDispatcherServlet.service_aroundBody2(SafeDispatcherServlet.java:115)
>> at org.jasig.cas.web.init.SafeDispatcherServlet.service_aroundBody3$advice(SafeDispatcherServlet.java:44)
>> at org.jasig.cas.web.init.SafeDispatcherServlet.service(SafeDispatcherServlet.java:1)
>> at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:290)
>> at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
>> at com.github.inspektr.common.web.ClientInfoThreadLocalFilter.doFilter(ClientInfoThreadLocalFilter.java:46)
>> at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
>> at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
>> at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233)
>> at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191)
>> at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:128)
>> at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102)
>> at org.apache.catalina.valves.AccessLogValve.invoke(AccessLogValve.java:568)
>> at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
>> at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:286)
>> at org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:845)
>> at org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:583)
>> at org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:447)
>> at java.lang.Thread.run(Thread.java:613)
>>
>>
>> --
>> You are currently subscribed to cas-...@lists.jasig.org as: sel...@uvic.ca
>> To unsubscribe, change settings or access archives, see http://www.ja-sig.org/wiki/display/JSG/cas-user
>>
>>
>>
>>
>


--
You are currently subscribed to cas-...@lists.jasig.org as: jasig-cas-user...@googlegroups.com
To unsubscribe, change settings or access archives, see http://www.ja-sig.org/wiki/display/JSG/cas-user

Marvin Addison

unread,
Jun 16, 2011, 10:48:32 AM6/16/11
to cas-...@lists.jasig.org
> Inspektr inability to insert records causing actions undertaken as part of
> Spring Web Flow conversations to hang (causing subsequent requests to access
> those conversations to wait on the lock for the conversation, thereby tying
> up Tomcat request processing threads out to exhaustion and thereby causing
> the CAS server to become nonresponsive) is an interesting hypothesis.

And I think you've ruled it out by pointing out that the write is
driven by Executors.newSingleThreadExecutor(). Even if the writing
thread were to die abnormally as you noted, the new replacement thread
would be allocated from a thread pool separate from the Tomcat
connector thread pool.

> Is the hypothesis that these threads terminate abnormally without returning
> their database connection to the pool?

I believe there are cases in the JVM where threads can terminate in
such a way that a finally block associated with code executing in a
catch block will not be executed on termination. This could create a
resource leak condition such as connection pool exhaustion in our
case. There would be clear evidence of this condition in the logs;
I'm certain it would be logged for com.mchange.v2 at DEBUG and other
connection pool libraries are likely similar. You should turn up
logging on your pooling library and monitor if you have any suspicion
about that case, but I think the likelihood is approaching zero.

Frankly I think it's a red herring attempting to associate SWF
conversation locks with database and threading matters. I believe
some SWF code review is in order as a next step, but honestly I don't
think there's enough evidence to justify it at present. At the next
report I'll be happy to dive in. If there are any other occurrences
of this problem that haven't been reported, please speak up.

M

Andrew Petro

unread,
Jun 16, 2011, 12:15:52 PM6/16/11
to cas-...@lists.jasig.org
> At the next report I'll be happy to dive in. If there are any other
occurrences
> of this problem that haven't been reported, please speak up.

My current interest stems from this SWF conversation deadlock having
apparently caused a production CAS server to become unresponsive on
Tuesday, requiring Tomcat restart to resolve (which did resolve).

I don't have any interesting insights about the problem to share so far.

Andrew

Scott Battaglia

unread,
Jun 16, 2011, 12:20:47 PM6/16/11
to cas-...@lists.jasig.org
At Rutgers a number of years ago we actually disabled SWF locking because it had generated a deadlock.  That might have been on SWF 1.x though.  I'm far removed from my time at RU now though :-)


On Thu, Jun 16, 2011 at 12:15 PM, Andrew Petro <ape...@unicon.net> wrote:
> At the next report I'll be happy to dive in. If there are any other occurrences
> of this problem that haven't been reported, please speak up.

My current interest stems from this SWF conversation deadlock having apparently caused a production CAS server to become unresponsive on Tuesday, requiring Tomcat restart to resolve (which did resolve).

I don't have any interesting insights about the problem to share so far.

Andrew




--
You are currently subscribed to cas-...@lists.jasig.org as: scott.b...@gmail.com

To unsubscribe, change settings or access archives, see http://www.ja-sig.org/wiki/display/JSG/cas-user

Marvin Addison

unread,
Jun 16, 2011, 12:33:54 PM6/16/11
to cas-...@lists.jasig.org
> At Rutgers a number of years ago we actually disabled SWF locking because it
> had generated a deadlock.

Deadlock feels exactly right here, but the mechanism of getting in a
deadlock state is mysterious. I would encourage everyone that's
experienced this issue to turn up
org.springframework.webflow.conversation to DEBUG in the hopes that it
will produce some meaningful data if/when it happens again.

M

Andrew Petro

unread,
Jun 20, 2011, 4:35:06 PM6/20/11
to cas-...@lists.jasig.org
Ok. I think I can make this exception happen by introducing a custom
AuthenticationHandler that simply hangs forever.

At least, I started getting this in a terribly messy log at the TRACE
level. ootb cas-server 3.4.8 with only the custom loop-forever
authentication handler introduced.

[

2011-06-20 16:13:02,558 TRACE
[org.jasig.cas.web.NoSuchFlowExecutionExceptionResolver] - Entering
method [resolveException with arguments
[[org.apache.catalina.connector.RequestFacade@5889949a,
org.apache.catalina.connector.ResponseFacade@307b37df,
[FlowHandlerMapping.DefaultFlowHandler@da3a52c],
org.springframework.webflow.conversation.impl.LockTimeoutException:
Unable to acquire conversation lock after 30 seconds]]
2011-06-20 16:13:02,558 TRACE
[org.jasig.cas.web.NoSuchFlowExecutionExceptionResolver] - Leaving
method [resolveException] with return value [null].

]

Looks to me like it only hangs one thread and no other threads manage to
even get to the authentication handler, though I submitted a handful of
login form posts from a couple of browsers.

I'll gin up a quick HangsJustOneThreadForeverAuthenticationHandler to
prove that it only takes one.

Andrew

>> --
>> You are currently subscribed to cas-...@lists.jasig.org as: sel...@uvic.ca

Andrew Petro

unread,
Jun 20, 2011, 5:41:37 PM6/20/11
to cas-...@lists.jasig.org
Yup. It only takes one AuthenticationHandler hung thread to block
further requests from even trying to authenticate.

Opened a JIRA and attached a simple Maven Overlay demonstrating the issue.

https://issues.jasig.org/browse/CAS-993

I'll move discussion of what to do about this over to cas-dev.

This tends to support a hypothesis that when this conversation deadlock
happened last week, it was an AuthenticationHandler that had started
responding really slowly, thereby hanging on to this lock for a long time.

Andrew

Andrew Petro

unread,
Jun 20, 2011, 9:22:54 PM6/20/11
to cas-...@lists.jasig.org
Sorry, it's more complicated than I'd thought. An embarrassing
synchronization bug in my example made the consequences of the problem
worse. One hung AuthenticationHandler isn't sufficient to deadlock the
CAS web flow for everyone.

Poking at it further...

Andrew

Reply all
Reply to author
Forward
0 new messages