Concurrent RestHelper requests cause freeze / maybe deadlock

29 views
Skip to first unread message

Stefan

unread,
Apr 11, 2016, 5:43:19 PM4/11/16
to Lift
Hello Lift people,

I've been using Lift in a project over the last six months and really enjoyed working with it! Thanks for the great work you've done on it.


Today I've run into a problem that I still couldn't figure out a solution for. A quick glance at the situation:

I'm using Lift as a backend for a Cordova HTML5 app and a JSON based REST API to bring the two together. Among others, the app user can retrieve a list of messages and a list of appointments, which are represented by rather simple MongoRecord classes.

My RestDispatcher directs the requests through these methods:

object RestDispatcher extends RestHelper with Logger {

  serve
{

    case req if req.requestType.options_? => OkResponse()

    // login functionality
    case ("api" :: "login" :: Nil) JsonPost (data -> _) =>
    /* ... */

   
case ("api" :: _) _ _ if !User.logged_in? => UnauthorizedResponse("")
   
case ("api" :: path) JsonGet _ =>
      trace
(s"GET request to $path")
     
get(path)

   
/* ... */
 
}

 
def get(path: List[String]) = path match {

   
/* ... */
 
   
case ("messages" :: Nil) =>
      trace
("messages requested")
     
JArray(Message.forCurrentUser.map(_.export))

   
case ("appointments" :: Nil) =>
      trace
("appointments requested")
     
JArray(Appointment.forCurrentUser.map(_.export))

   
/* ... */
 
}
}

This structure worked very well over the last six months. Now I've recently added the appointments route, and at app startup, requests to both routes are made to fetch some initial data. This is where the problem occurs: On every second or third server restart, the first set of requests from the app causes a complete server freeze. On the others it works just fine. This happens both on my development system and the production deployment.

One really odd thing striked me during the debugging: When I replace the lines in the RestDispatcher such that they simply return an empty JArray, everything works fine. But if I replace the accessor methods on the model objects such that they directly return empty lists (on the Message and Appointment objects: def forCurrentUser: List[Message] = { trace("forCurrentUser called") ; Nil }), the freeze occurs once again. The log messages from the RestDispatcher (messages/appointments requested) both show up in the log, the messages from the respective model objects (forCurrentUser called) do not. It breaks right before "leaving" the RestDispatcher in both cases.

At first, I thought that this issue can be solved by increasing the maxConcurrentRequests rule (although it is already set to 2 by default), but this didn't change anything. Also, there have always been other situations where multiple requests are made simultaneously, and it didn't ever cause any problems so far. I've tried it with both Lift 2.6.2 and 3.0-RC1, still no difference. I'm running the code through the sbt task jetty:start from earldouglas' xsbt-web-plugin.

When I look at the process through jconsole, I can see that after the freeze happened, it keeps on spawning worker threads (although no further requests are made). All these threads are in WAITING state, suspending on some AbstractQueuedSynchronizer$ConditionObject. I've tried to figure out where Lift uses these but couldn't quite find the right pieces of code. The login test you see in the RestDispatcher (User.logged_in?) is based on a SessionVar, and the first thing I thought about was a concurrency problem / deadlock situation. However, as far as I see it, all SessionVar.get calls are synchronized by default, aren't they? 

Can you maybe give me some hints on what else I might try to figure out what's going wrong here? Thanks a lot!

Stefan

Antonio Salazar Cardozo

unread,
Apr 11, 2016, 9:04:49 PM4/11/16
to Lift
Can you maybe post a thread dump taken while the application is deadlocked?

Additionally, can you show us what `forCurrentUser` looks like?

Those two things would probably be a good start to diagnosing the issue.
Thanks,
Antonio

Stefan

unread,
Apr 12, 2016, 2:37:59 AM4/12/16
to Lift
Hello Antonio,

sure, thanks. I'm attaching two subsequent thread dumps, you can see that the number of threads keeps growing although there is no new activity. It appearantly stops spawning threads after pool-4-thread-16.


The forCurrentUser methods look like this (using Rogue; analogous for the other model class):

def forCurrentUser: List[Message] = {

  info
("calling Message.forCurrentUser")
  info
(s"user id is ${User.currentId}")
 
Message.where(_.user eqs User.currentId).fetch()
}


However, as mentioned, the error still occurs when I just change both methods to:

def forCurrentUser: List[Message] = {

  info
("calling Message.forCurrentUser")
 
Nil
}


The User.currentId method looks like this:

object User {
 
 
/* ... */

 
private object currentIdVar extends SessionVar[Box[ObjectId]](Empty)

 
def currentId = currentIdVar.get.openOr(ObjectId.get) // returns a valid but nonexistant ObjectId if none is stored
 
def logged_in? = currentIdVar.get.isDefined

 
/* ... */
}


Is there something else that might be helpful?

Thanks a lot for your help!

Stefan
threaddump2.txt
threaddump.txt

Antonio Salazar Cardozo

unread,
Apr 12, 2016, 8:58:43 AM4/12/16
to Lift
I don't think the new threads have much to do with it—it's normal for jetty to
spawn a bunch of threads (in a thread pool) that it doesn't immediately use
and end up waiting or timed_waiting on various locks until they're needed.

I'm a little more interested in these:

	at de.skosoft.bollogg.model.ClicklePk$stream$.<init>(Clickle.scala:332)
	at de.skosoft.bollogg.model.ClicklePk.stream$lzycompute(Clickle.scala:332)
	- locked <0x000000077069b948> (a de.skosoft.bollogg.model.ClicklePk$)

And:

"qtp885951223-16" #16 prio=5 os_prio=0 tid=0x00007fafa024f800 nid=0x246f in Object.wait() [0x00007faf89d9f000]
   java.lang.Thread.State: RUNNABLE
	at de.skosoft.bollogg.model.Clickle$id$.<init>(Clickle.scala:33)
	at de.skosoft.bollogg.model.Clickle.id$lzycompute(Clickle.scala:33)
	- locked <0x000000077048fad0> (a de.skosoft.bollogg.model.Clickle$)
	at de.skosoft.bollogg.model.Clickle.id(Clickle.scala:33)

They seem like the more likely source of the issue. How are the Clickle id
and the Clickle stream defined? Going through those two thread traces may
yield some interdependencies that are causing unexpected deadlocking.
Thanks,
Antonio

Stefan

unread,
Apr 12, 2016, 10:03:27 AM4/12/16
to Lift
Looked at those parts, and it gave me a hint that didn't yet really point me to the problem, but at least enabled me to create a workaround. Thank you so much, Antonio, this was really driving me nuts.

I didn't yet quite identify the interdependency, but both models contain a reference to a third one, Stream. When the server first starts and the app is the first to make contact, the two requests in question (appointments and messages) both seem to kick of the initialization of the respective Record and MetaRecord classes (which I didn't even know was happening actually). These seem to work recursively on their member types and both reach the Stream model at some point, more or less simultaneously, and this seems to cause the deadlock. 

Kicking off this initialization of the model classes at boot time made the error go away, and gave not only me quite some relief.

Once things have calmed a little here and I've finished the things on my list I would have liked to finish yesterday already, I'll take another deeper look at this, try to get to the actual source of the problem and post a little more about it. Maybe it will help someone else.

Thanks again!

Antonio Salazar Cardozo

unread,
Apr 12, 2016, 10:12:15 AM4/12/16
to Lift
Awesome, glad that helped and hope you get to the bottom of it! It certainly
sounds like the kind of unexpected whoopsie that can kick in at unexpected
times :) Looking forward to the results of your more detailed investigation!
Thanks,
Antonio
Reply all
Reply to author
Forward
0 new messages