Hey folks,
In my eternal quest against comet-related NullPointerExceptions, I did another deep dive of the comet code with some help from a handy jdb stack trace for one of the aforementioned NPEs. My discovery might help some things. First, a quick summary of some new discoveries on our part:
When the server starts hitting NPEs (we've previously found that these happen when the server is overloaded enough with requests that it starts hitting requests that have been expired by the container, and therefore are no longer valid), we noticed that we were a bit hosed if we tried to reload the page, but if we cleared our cookies things worked significantly better. This pointed to an issue with the session.
I'll be the first to admit that my familiarity with the comet code isn't 100%, so I'm going to propose a solution and see if anyone thinks it's a catastrophic idea before trying it; first, some details:
- This is currently only used in makeCometBreakoutDecision, so its sole purpose is basically to determine which requests to inspect to see if they need to be dropped (this mechanism exists so that we don't try to keep alive more requests than a given browser can support for a given host, by the way).
- That list is then inspected and the oldest N comet requests are dropped, where N is the difference between open requests and allowed requests for the given browser.
I propose we change session.cometForHost to return a tuple or case class. This will represent (a) available, valid comets and (b) comets with dead requests that need to be dropped immediately. makeCometBreakoutDecision will then ask the appropriate ContinuationActors to BreakOut() (a) if they've been deemed invalid and then (b) if there are too many requests still open after that. We would deem a comet invalid if r.hostAndPath throws an exception of any sort.
The purpose here isn't to completely cure the NPE issue, but to make sure it does not poison the LiftSession, so that further requests can still work. This should also make NPEs recoverable to some extent, in that I believe as long as your CometActor's lifeSpan isn't too low, a subsequent request should be able to catch and properly lead to the appropriate CometActor.
Let me know if there's anything I can clarify, or if anything here seems like it would break things terribly.
Thanks,
Antonio