More NPE love

196 views
Skip to first unread message

Antonio Salazar Cardozo

unread,
Jun 7, 2012, 10:52:06 AM6/7/12
to lif...@googlegroups.com
Hey folks,
In my eternal quest against comet-related NullPointerExceptions, I did another deep dive of the comet code with some help from a handy jdb stack trace for one of the aforementioned NPEs. My discovery might help some things. First, a quick summary of some new discoveries on our part:

When the server starts hitting NPEs (we've previously found that these happen when the server is overloaded enough with requests that it starts hitting requests that have been expired by the container, and therefore are no longer valid), we noticed that we were a bit hosed if we tried to reload the page, but if we cleared our cookies things worked significantly better. This pointed to an issue with the session.

After investigating the code, I think what's happening is that a given Lift session is actually poisoned by one of these invalid requests: the request breaks the session's makeCometBreakoutDecision (by throwing an NPE when we try to access the request's underlying server name), which in turn breaks LiftRules._getLiftSession (see https://github.com/lift/framework/blob/master/web/webkit/src/main/scala/net/liftweb/http/LiftRules.scala#L293 ), which, finally, breaks any stateful request (see https://github.com/lift/framework/blob/master/web/webkit/src/main/scala/net/liftweb/http/LiftServlet.scala#L273 ). Because this happens in what is essentially the process that decides which comet requests to go ahead and drop, we get into a situation where the broken request is never removed, therefore the LiftSession is “poisoned”: it is unable to process requests, and unable to get out of the state where it cannot process requests.

I'll be the first to admit that my familiarity with the comet code isn't 100%, so I'm going to propose a solution and see if anyone thinks it's a catastrophic idea before trying it; first, some details:

 - In LiftRules.makeCometBreakoutDecision, we call session.cometForHost. This is where things start throwing exceptions (see https://github.com/lift/framework/blob/master/web/webkit/src/main/scala/net/liftweb/http/LiftSession.scala#L615 ).
 - This is currently only used in makeCometBreakoutDecision, so its sole purpose is basically to determine which requests to inspect to see if they need to be dropped (this mechanism exists so that we don't try to keep alive more requests than a given browser can support for a given host, by the way).
 - That list is then inspected and the oldest N comet requests are dropped, where N is the difference between open requests and allowed requests for the given browser.

I propose we change session.cometForHost to return a tuple or case class. This will represent (a) available, valid comets and (b) comets with dead requests that need to be dropped immediately. makeCometBreakoutDecision will then ask the appropriate ContinuationActors to BreakOut() (a) if they've been deemed invalid and then (b) if there are too many requests still open after that. We would deem a comet invalid if r.hostAndPath throws an exception of any sort.

The purpose here isn't to completely cure the NPE issue, but to make sure it does not poison the LiftSession, so that further requests can still work. This should also make NPEs recoverable to some extent, in that I believe as long as your CometActor's lifeSpan isn't too low, a subsequent request should be able to catch and properly lead to the appropriate CometActor.

Let me know if there's anything I can clarify, or if anything here seems like it would break things terribly.
Thanks,
Antonio

David Pollak

unread,
Jun 7, 2012, 1:52:53 PM6/7/12
to lif...@googlegroups.com
+1

Thank you very, very much for chasing this issue down.

We haven't built M1, so go ahead and push it to master as soon as you want.

--
Lift, the simply functional web framework: http://liftweb.net
Code: http://github.com/lift
Discussion: http://groups.google.com/group/liftweb
Stuck? Help us help you: https://www.assembla.com/wiki/show/liftweb/Posting_example_code



--
Telegram, Simply Beautiful CMS https://telegr.am
Lift, the simply functional web framework http://liftweb.net


Peter Robinett

unread,
Jun 7, 2012, 2:40:51 PM6/7/12
to lif...@googlegroups.com
This is epic work and deserving of a Happy Lift'r or three!

Peter

David Pollak

unread,
Jun 7, 2012, 2:42:16 PM6/7/12
to lif...@googlegroups.com
On Thu, Jun 7, 2012 at 11:40 AM, Peter Robinett <pe...@bubblefoundry.com> wrote:
This is epic work and deserving of a Happy Lift'r or three!


+3
 

--

Jack Park

unread,
Jun 7, 2012, 2:54:23 PM6/7/12
to lif...@googlegroups.com
Indeed!

Antonio Salazar Cardozo

unread,
Jun 7, 2012, 4:20:15 PM6/7/12
to lif...@googlegroups.com
Haha I appreciate it, but let's not get ahead of ourselves. First things first: getting it up, running, and pushed ;) Then checking if it actually fixes the issue :p
Thanks,
Antonio

Antonio Salazar Cardozo

unread,
Jun 7, 2012, 6:38:29 PM6/7/12
to lif...@googlegroups.com
I've tested this locally and pushed the result to master. It doesn't seem to have broken existing comet code, but I have no confirmation that it fixes the issues I think it fixes. We can get this on our production servers either tomorrow or Monday (depending on a couple of other things) once Jenkins has spun a new version of 2.5-SNAPSHOT, if you'd prefer to have it battle-tested in at least one place before spinning M1.
Thanks,
Antonio

Diego Medina

unread,
Jun 7, 2012, 8:03:40 PM6/7/12
to lif...@googlegroups.com
Thanks for all the work Antonio!

On Thu, Jun 7, 2012 at 6:38 PM, Antonio Salazar Cardozo
Diego Medina
Lift/Scala Developer
di...@fmpwizard.com
http://www.fmpwizard.com

David Pollak

unread,
Jun 7, 2012, 8:36:31 PM6/7/12
to lif...@googlegroups.com
On Thu, Jun 7, 2012 at 5:03 PM, Diego Medina <di...@fmpwizard.com> wrote:
Thanks for all the work Antonio!

+1

Mads Hartmann Jensen

unread,
Jun 8, 2012, 2:50:50 AM6/8/12
to lif...@googlegroups.com
Guys, I'm not scanning all of the emails, please vote on the 
Happy Lift'r thread ;) 

Cheers,
Mads

Antonio Salazar Cardozo

unread,
Jun 11, 2012, 3:57:09 PM6/11/12
to lif...@googlegroups.com
We've been running this for ~16 hours or so on our production server, everything seems to be running smoothly. Haven't seen any NPEs either, but that often takes a couple of days after a restart to manifest. The important thing I was looking for is that nothing seems to have broken :)
Thanks,
Antonio

Antonio Salazar Cardozo

unread,
Jun 12, 2012, 2:35:46 PM6/12/12
to lif...@googlegroups.com
Last round: this morning during a rather hairy deploy, we saw the kind of load that generally stuffs us full of these NPEs, and not a one was to be seen. Not foolproof evidence, but as far as we can tell, this problem has been put to bed! :) We'll report back if we discover otherwise.
Thanks,
Antonio

Diego Medina

unread,
Jun 12, 2012, 3:15:23 PM6/12/12
to lif...@googlegroups.com
Great news!

On Tue, Jun 12, 2012 at 2:35 PM, Antonio Salazar Cardozo

David Pollak

unread,
Jun 12, 2012, 3:56:33 PM6/12/12
to lif...@googlegroups.com
On Tue, Jun 12, 2012 at 11:35 AM, Antonio Salazar Cardozo <savedf...@gmail.com> wrote:
Last round: this morning during a rather hairy deploy, we saw the kind of load that generally stuffs us full of these NPEs, and not a one was to be seen. Not foolproof evidence, but as far as we can tell, this problem has been put to bed! :) We'll report back if we discover otherwise.

Excellent!!!
Reply all
Reply to author
Forward
0 new messages