[ISSUE] (TEPHRA-132) ThriftRPCServer can hang when primary transaction service loses leadership

2 views
Skip to first unread message

Poorna Chandra (JIRA)

unread,
Sep 28, 2015, 7:34:21 PM9/28/15
to tephr...@googlegroups.com
Poorna Chandra created an issue
 
Tephra / Bug TEPHRA-132
ThriftRPCServer can hang when primary transaction service loses leadership
Issue Type: Bug Bug
Affects Versions: 0.6.0, 0.5.0, 0.4.0, 0.3.0, 0.2.0, 0.1.0
Assignee: Poorna Chandra
Components: core
Created: 28/Sep/15 4:32 PM
Fix Versions: 0.6.3
Priority: Major Major
Reporter: Poorna Chandra

When primary transaction service loses leadership, a call to stop Thrift server is made. Under heavy connection load the Thrift server can hang during stop, thus not allowing the leader to pass on the leadership to another transaction service process. This leads to transaction service becoming unresponsive to the clients.

Here are the sequence of events that can lead to this -

  • Due to large number of connections, the AcceptThread of TThreadedSelectorServer blocks while trying to add a new connection to the accepted queue of a SelectorThread.
  • The SelectorThreads are waiting for some transaction operation to complete.
  • At this time if the service loses leadership, a call to stop Thrift server is made.
  • TThreadedSelectorServer.stop() method sets stop flag to true, and wakes up selectors of AcceptThread and SelectorThreads.
  • The SelectorThread on wakeup sees that the stop flag is true, exits without removing any more elements from its accepted queue.
  • AcceptThread continues to block on the accepted queue, thus not allowing the shutdown sequence of ThriftRPCServer to proceed. This leads to leadership remaining with the current service that has partially shutdown, and makes the transaction service unresponsive.

Stacktrace when the Thrift server hangs -

"ThriftRPCServer" daemon prio=5 tid=0x00007fbb39157000 nid=0x6503 in Object.wait() [0x0000000115767000]
   java.lang.Thread.State: WAITING (on object monitor)
	at java.lang.Object.wait(Native Method)
	- waiting on <0x00000007af5ae380> (a org.apache.thrift.server.TThreadedSelectorServer$AcceptThread)
	at java.lang.Thread.join(Thread.java:1281)
	- locked <0x00000007af5ae380> (a org.apache.thrift.server.TThreadedSelectorServer$AcceptThread)
	at java.lang.Thread.join(Thread.java:1355)
	at org.apache.thrift.server.TThreadedSelectorServer.joinThreads(TThreadedSelectorServer.java:251)
	at org.apache.thrift.server.TThreadedSelectorServer.waitForShutdown(TThreadedSelectorServer.java:241)
	at org.apache.thrift.server.AbstractNonblockingServer.serve(AbstractNonblockingServer.java:94)
	at co.cask.tephra.rpc.ThriftRPCServer.run(ThriftRPCServer.java:210)
	at com.google.common.util.concurrent.AbstractExecutionThreadService$1$1.run(AbstractExecutionThreadService.java:52)
	at java.lang.Thread.run(Thread.java:745)

"Thread-5" daemon prio=5 tid=0x00007fbb3a81e000 nid=0x7303 waiting on condition [0x0000000115e7c000]
   java.lang.Thread.State: WAITING (parking)
	at sun.misc.Unsafe.park(Native Method)
	- parking to wait for  <0x00000007af5ae3f8> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
	at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
	at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2043)
	at java.util.concurrent.ArrayBlockingQueue.put(ArrayBlockingQueue.java:324)
	at org.apache.thrift.server.TThreadedSelectorServer$SelectorThread.addAcceptedConnection(TThreadedSelectorServer.java:520)
	at org.apache.thrift.server.TThreadedSelectorServer$AcceptThread.doAddAccept(TThreadedSelectorServer.java:462)
	at org.apache.thrift.server.TThreadedSelectorServer$AcceptThread.handleAccept(TThreadedSelectorServer.java:433)
	at org.apache.thrift.server.TThreadedSelectorServer$AcceptThread.select(TThreadedSelectorServer.java:413)
	at org.apache.thrift.server.TThreadedSelectorServer$AcceptThread.run(TThreadedSelectorServer.java:375)
Add Comment Add Comment
 
This message was sent by Atlassian JIRA (v6.1.5#6160-sha1:a61a0fc)
Atlassian logo

Priyanka Nambiar (JIRA)

unread,
Oct 5, 2015, 7:38:22 AM10/5/15
to tephr...@googlegroups.com

Poorna Chandra (JIRA)

unread,
Oct 12, 2015, 6:25:21 PM10/12/15
to tephr...@googlegroups.com
When primary transaction service loses leadership, a call to stop Thrift server is made. Under heavy connection load the Thrift server can hang during stop, thus not allowing the leader to pass on the leadership to another transaction service process. This leads to transaction service becoming unresponsive to the clients.

Here are the sequence of event...

Poorna Chandra (JIRA)

unread,
Oct 12, 2015, 7:50:21 PM10/12/15
to tephr...@googlegroups.com
Poorna Chandra resolved an issue as Fixed
Change By: Poorna Chandra
Status: Open Resolved
Resolution: Fixed

Poorna Chandra (JIRA)

unread,
Nov 3, 2015, 5:31:21 PM11/3/15
to tephr...@googlegroups.com
Poorna Chandra commented on an issue
When primary transaction service loses leadership, a call to stop Thrift server is made. Under heavy connection load the Thrift server can hang during stop, thus not allowing the leader to pass on the leadership to another transaction service process. This leads to transaction service becoming unresponsive to the clients.

Here are the sequence of event...

Poorna Chandra (JIRA)

unread,
Nov 3, 2015, 7:22:25 PM11/3/15
to tephr...@googlegroups.com

Poorna Chandra (JIRA)

unread,
Nov 4, 2015, 12:29:21 AM11/4/15
to tephr...@googlegroups.com
When primary transaction service loses leadership, a call to stop Thrift server is made. Under heavy connection load the Thrift server can hang during stop, thus not allowing the leader to pass on the leadership to another transaction service process. This leads to transaction service becoming unresponsive to the clients.

Here are the sequence of event...
Reply all
Reply to author
Forward
0 new messages