Hey folks,
In my eternal quest against comet-related NullPointerExceptions, I did another deep dive of the comet code with some help from a handy jdb stack trace for one of the aforementioned NPEs. My discovery might help some things. First, a quick summary of some new discoveries on our part:
When the server starts hitting NPEs (we've previously found that these happen when the server is overloaded enough with requests that it starts hitting requests that have been expired by the container, and therefore are no longer valid), we noticed that we were a bit hosed if we tried to reload the page, but if we cleared our cookies things worked significantly better. This pointed to an issue with the session.
After investigating the code, I think what's happening is that a given Lift session is actually poisoned by one of these invalid requests: the request breaks the session's makeCometBreakoutDecision (by throwing an NPE when we try to access the request's underlying server name), which in turn breaks LiftRules._getLiftSession (see https://github.com/lift/framework/blob/master/web/webkit/src/main/sca... ), which, finally, breaks any stateful request (see https://github.com/lift/framework/blob/master/web/webkit/src/main/sca... ). Because this happens in what is essentially the process that decides which comet requests to go ahead and drop, we get into a situation where the broken request is never removed, therefore the LiftSession is “poisoned”: it is unable to process requests, and unable to get out of the state where it cannot process requests.
I'll be the first to admit that my familiarity with the comet code isn't 100%, so I'm going to propose a solution and see if anyone thinks it's a catastrophic idea before trying it; first, some details:
- In LiftRules.makeCometBreakoutDecision, we call session.cometForHost. This is where things start throwing exceptions (see https://github.com/lift/framework/blob/master/web/webkit/src/main/sca... ).
- This is currently only used in makeCometBreakoutDecision, so its sole purpose is basically to determine which requests to inspect to see if they need to be dropped (this mechanism exists so that we don't try to keep alive more requests than a given browser can support for a given host, by the way).
- That list is then inspected and the oldest N comet requests are dropped, where N is the difference between open requests and allowed requests for the given browser.
I propose we change session.cometForHost to return a tuple or case class. This will represent (a) available, valid comets and (b) comets with dead requests that need to be dropped immediately. makeCometBreakoutDecision will then ask the appropriate ContinuationActors to BreakOut() (a) if they've been deemed invalid and then (b) if there are too many requests still open after that. We would deem a comet invalid if r.hostAndPath throws an exception of any sort.
The purpose here isn't to completely cure the NPE issue, but to make sure it does not poison the LiftSession, so that further requests can still work. This should also make NPEs recoverable to some extent, in that I believe as long as your CometActor's lifeSpan isn't too low, a subsequent request should be able to catch and properly lead to the appropriate CometActor.
Let me know if there's anything I can clarify, or if anything here seems like it would break things terribly.
Thanks,
Antonio
savedfastc...@gmail.com> wrote:
> Hey folks,
> In my eternal quest against comet-related NullPointerExceptions, I did
> another deep dive of the comet code with some help from a handy jdb stack
> trace for one of the aforementioned NPEs. My discovery might help some
> things. First, a quick summary of some new discoveries on our part:
> When the server starts hitting NPEs (we've previously found that these
> happen when the server is overloaded enough with requests that it starts
> hitting requests that have been expired by the container, and therefore are
> no longer valid), we noticed that we were a bit hosed if we tried to reload
> the page, but if we cleared our cookies things worked significantly better.
> This pointed to an issue with the session.
> After investigating the code, I think what's happening is that a given
> Lift session is actually poisoned by one of these invalid requests: the
> request breaks the session's makeCometBreakoutDecision (by throwing an NPE
> when we try to access the request's underlying server name), which in turn
> breaks LiftRules._getLiftSession (see
> https://github.com/lift/framework/blob/master/web/webkit/src/main/sca...), which, finally, breaks any stateful request (see
> https://github.com/lift/framework/blob/master/web/webkit/src/main/sca...). Because this happens in what is essentially the process that decides
> which comet requests to go ahead and drop, we get into a situation where
> the broken request is never removed, therefore the LiftSession is
> “poisoned”: it is unable to process requests, and unable to get out of the
> state where it cannot process requests.
> I'll be the first to admit that my familiarity with the comet code isn't
> 100%, so I'm going to propose a solution and see if anyone thinks it's a
> catastrophic idea before trying it; first, some details:
> - In LiftRules.makeCometBreakoutDecision, we call session.cometForHost.
> This is where things start throwing exceptions (see
> https://github.com/lift/framework/blob/master/web/webkit/src/main/sca...).
> - This is currently only used in makeCometBreakoutDecision, so its sole
> purpose is basically to determine which requests to inspect to see if they
> need to be dropped (this mechanism exists so that we don't try to keep
> alive more requests than a given browser can support for a given host, by
> the way).
> - That list is then inspected and the oldest N comet requests are
> dropped, where N is the difference between open requests and allowed
> requests for the given browser.
> I propose we change session.cometForHost to return a tuple or case class.
> This will represent (a) available, valid comets and (b) comets with dead
> requests that need to be dropped immediately. makeCometBreakoutDecision
> will then ask the appropriate ContinuationActors to BreakOut() (a) if
> they've been deemed invalid and then (b) if there are too many requests
> still open after that. We would deem a comet invalid if r.hostAndPath
> throws an exception of any sort.
> The purpose here isn't to completely cure the NPE issue, but to make sure
> it does not poison the LiftSession, so that further requests can still
> work. This should also make NPEs recoverable to some extent, in that I
> believe as long as your CometActor's lifeSpan isn't too low, a subsequent
> request should be able to catch and properly lead to the appropriate
> CometActor.
> Let me know if there's anything I can clarify, or if anything here seems
> like it would break things terribly.
> Thanks,
> Antonio
On Thursday, June 7, 2012 4:52:06 PM UTC+2, Antonio Salazar Cardozo wrote:
> Hey folks,
> In my eternal quest against comet-related NullPointerExceptions, I did > another deep dive of the comet code with some help from a handy jdb stack > trace for one of the aforementioned NPEs. My discovery might help some > things. First, a quick summary of some new discoveries on our part:
> When the server starts hitting NPEs (we've previously found that these > happen when the server is overloaded enough with requests that it starts > hitting requests that have been expired by the container, and therefore are > no longer valid), we noticed that we were a bit hosed if we tried to reload > the page, but if we cleared our cookies things worked significantly better. > This pointed to an issue with the session.
> After investigating the code, I think what's happening is that a given > Lift session is actually poisoned by one of these invalid requests: the > request breaks the session's makeCometBreakoutDecision (by throwing an NPE > when we try to access the request's underlying server name), which in turn > breaks LiftRules._getLiftSession (see > https://github.com/lift/framework/blob/master/web/webkit/src/main/sca...), which, finally, breaks any stateful request (see > https://github.com/lift/framework/blob/master/web/webkit/src/main/sca...). Because this happens in what is essentially the process that decides > which comet requests to go ahead and drop, we get into a situation where > the broken request is never removed, therefore the LiftSession is > “poisoned”: it is unable to process requests, and unable to get out of the > state where it cannot process requests.
> I'll be the first to admit that my familiarity with the comet code isn't > 100%, so I'm going to propose a solution and see if anyone thinks it's a > catastrophic idea before trying it; first, some details:
> - In LiftRules.makeCometBreakoutDecision, we call session.cometForHost. > This is where things start throwing exceptions (see > https://github.com/lift/framework/blob/master/web/webkit/src/main/sca...).
> - This is currently only used in makeCometBreakoutDecision, so its sole > purpose is basically to determine which requests to inspect to see if they > need to be dropped (this mechanism exists so that we don't try to keep > alive more requests than a given browser can support for a given host, by > the way).
> - That list is then inspected and the oldest N comet requests are > dropped, where N is the difference between open requests and allowed > requests for the given browser.
> I propose we change session.cometForHost to return a tuple or case class. > This will represent (a) available, valid comets and (b) comets with dead > requests that need to be dropped immediately. makeCometBreakoutDecision > will then ask the appropriate ContinuationActors to BreakOut() (a) if > they've been deemed invalid and then (b) if there are too many requests > still open after that. We would deem a comet invalid if r.hostAndPath > throws an exception of any sort.
> The purpose here isn't to completely cure the NPE issue, but to make sure > it does not poison the LiftSession, so that further requests can still > work. This should also make NPEs recoverable to some extent, in that I > believe as long as your CometActor's lifeSpan isn't too low, a subsequent > request should be able to catch and properly lead to the appropriate > CometActor.
> Let me know if there's anything I can clarify, or if anything here seems > like it would break things terribly.
> Thanks,
> Antonio
> On Thursday, June 7, 2012 4:52:06 PM UTC+2, Antonio Salazar Cardozo wrote:
>> Hey folks,
>> In my eternal quest against comet-related NullPointerExceptions, I did
>> another deep dive of the comet code with some help from a handy jdb stack
>> trace for one of the aforementioned NPEs. My discovery might help some
>> things. First, a quick summary of some new discoveries on our part:
>> When the server starts hitting NPEs (we've previously found that these
>> happen when the server is overloaded enough with requests that it starts
>> hitting requests that have been expired by the container, and therefore are
>> no longer valid), we noticed that we were a bit hosed if we tried to reload
>> the page, but if we cleared our cookies things worked significantly better.
>> This pointed to an issue with the session.
>> After investigating the code, I think what's happening is that a given
>> Lift session is actually poisoned by one of these invalid requests: the
>> request breaks the session's makeCometBreakoutDecision (by throwing an NPE
>> when we try to access the request's underlying server name), which in turn
>> breaks LiftRules._getLiftSession (see https://github.com/lift/** >> framework/blob/master/web/**webkit/src/main/scala/net/**
>> liftweb/http/LiftRules.scala#**L293<https://github.com/lift/framework/blob/master/web/webkit/src/main/sca...>), which, finally, breaks any stateful request (see
>> https://github.com/lift/**framework/blob/master/web/** >> webkit/src/main/scala/net/**liftweb/http/LiftServlet.**scala#L273<https://github.com/lift/framework/blob/master/web/webkit/src/main/sca...>). Because this happens in what is essentially the process that decides
>> which comet requests to go ahead and drop, we get into a situation where
>> the broken request is never removed, therefore the LiftSession is
>> “poisoned”: it is unable to process requests, and unable to get out of the
>> state where it cannot process requests.
>> I'll be the first to admit that my familiarity with the comet code isn't
>> 100%, so I'm going to propose a solution and see if anyone thinks it's a
>> catastrophic idea before trying it; first, some details:
>> - In LiftRules.**makeCometBreakoutDecision, we call
>> session.cometForHost. This is where things start throwing exceptions (see
>> https://github.com/lift/**framework/blob/master/web/** >> webkit/src/main/scala/net/**liftweb/http/LiftSession.**scala#L615<https://github.com/lift/framework/blob/master/web/webkit/src/main/sca...>).
>> - This is currently only used in makeCometBreakoutDecision, so its sole
>> purpose is basically to determine which requests to inspect to see if they
>> need to be dropped (this mechanism exists so that we don't try to keep
>> alive more requests than a given browser can support for a given host, by
>> the way).
>> - That list is then inspected and the oldest N comet requests are
>> dropped, where N is the difference between open requests and allowed
>> requests for the given browser.
>> I propose we change session.cometForHost to return a tuple or case class.
>> This will represent (a) available, valid comets and (b) comets with dead
>> requests that need to be dropped immediately. makeCometBreakoutDecision
>> will then ask the appropriate ContinuationActors to BreakOut() (a) if
>> they've been deemed invalid and then (b) if there are too many requests
>> still open after that. We would deem a comet invalid if r.hostAndPath
>> throws an exception of any sort.
>> The purpose here isn't to completely cure the NPE issue, but to make sure
>> it does not poison the LiftSession, so that further requests can still
>> work. This should also make NPEs recoverable to some extent, in that I
>> believe as long as your CometActor's lifeSpan isn't too low, a subsequent
>> request should be able to catch and properly lead to the appropriate
>> CometActor.
>> Let me know if there's anything I can clarify, or if anything here seems
>> like it would break things terribly.
>> Thanks,
>> Antonio
> On Thu, Jun 7, 2012 at 11:40 AM, Peter Robinett <pe...@bubblefoundry.com>
> wrote:
>> This is epic work and deserving of a Happy Lift'r or three!
> +3
>> Peter
>> On Thursday, June 7, 2012 4:52:06 PM UTC+2, Antonio Salazar Cardozo wrote:
>>> Hey folks,
>>> In my eternal quest against comet-related NullPointerExceptions, I did
>>> another deep dive of the comet code with some help from a handy jdb stack
>>> trace for one of the aforementioned NPEs. My discovery might help some
>>> things. First, a quick summary of some new discoveries on our part:
>>> When the server starts hitting NPEs (we've previously found that these
>>> happen when the server is overloaded enough with requests that it starts
>>> hitting requests that have been expired by the container, and therefore are
>>> no longer valid), we noticed that we were a bit hosed if we tried to reload
>>> the page, but if we cleared our cookies things worked significantly better.
>>> This pointed to an issue with the session.
>>> After investigating the code, I think what's happening is that a given
>>> Lift session is actually poisoned by one of these invalid requests: the
>>> request breaks the session's makeCometBreakoutDecision (by throwing an NPE
>>> when we try to access the request's underlying server name), which in turn
>>> breaks LiftRules._getLiftSession (see
>>> https://github.com/lift/framework/blob/master/web/webkit/src/main/sca... >>> ), which, finally, breaks any stateful request (see
>>> https://github.com/lift/framework/blob/master/web/webkit/src/main/sca... >>> ). Because this happens in what is essentially the process that decides
>>> which comet requests to go ahead and drop, we get into a situation where the
>>> broken request is never removed, therefore the LiftSession is “poisoned”: it
>>> is unable to process requests, and unable to get out of the state where it
>>> cannot process requests.
>>> I'll be the first to admit that my familiarity with the comet code isn't
>>> 100%, so I'm going to propose a solution and see if anyone thinks it's a
>>> catastrophic idea before trying it; first, some details:
>>> - In LiftRules.makeCometBreakoutDecision, we call session.cometForHost.
>>> This is where things start throwing exceptions (see
>>> https://github.com/lift/framework/blob/master/web/webkit/src/main/sca... >>> ).
>>> - This is currently only used in makeCometBreakoutDecision, so its sole
>>> purpose is basically to determine which requests to inspect to see if they
>>> need to be dropped (this mechanism exists so that we don't try to keep alive
>>> more requests than a given browser can support for a given host, by the
>>> way).
>>> - That list is then inspected and the oldest N comet requests are
>>> dropped, where N is the difference between open requests and allowed
>>> requests for the given browser.
>>> I propose we change session.cometForHost to return a tuple or case class.
>>> This will represent (a) available, valid comets and (b) comets with dead
>>> requests that need to be dropped immediately. makeCometBreakoutDecision will
>>> then ask the appropriate ContinuationActors to BreakOut() (a) if they've
>>> been deemed invalid and then (b) if there are too many requests still open
>>> after that. We would deem a comet invalid if r.hostAndPath throws an
>>> exception of any sort.
>>> The purpose here isn't to completely cure the NPE issue, but to make sure
>>> it does not poison the LiftSession, so that further requests can still work.
>>> This should also make NPEs recoverable to some extent, in that I believe as
>>> long as your CometActor's lifeSpan isn't too low, a subsequent request
>>> should be able to catch and properly lead to the appropriate CometActor.
>>> Let me know if there's anything I can clarify, or if anything here seems
>>> like it would break things terribly.
>>> Thanks,
>>> Antonio
Haha I appreciate it, but let's not get ahead of ourselves. First things first: getting it up, running, and pushed ;) Then checking if it actually fixes the issue :p
Thanks,
Antonio
On Thursday, June 7, 2012 2:40:51 PM UTC-4, Peter Robinett wrote:
> This is epic work and deserving of a Happy Lift'r or three!
> Peter
> On Thursday, June 7, 2012 4:52:06 PM UTC+2, Antonio Salazar Cardozo wrote:
>> Hey folks,
>> In my eternal quest against comet-related NullPointerExceptions, I did >> another deep dive of the comet code with some help from a handy jdb stack >> trace for one of the aforementioned NPEs. My discovery might help some >> things. First, a quick summary of some new discoveries on our part:
>> When the server starts hitting NPEs (we've previously found that these >> happen when the server is overloaded enough with requests that it starts >> hitting requests that have been expired by the container, and therefore are >> no longer valid), we noticed that we were a bit hosed if we tried to reload >> the page, but if we cleared our cookies things worked significantly better. >> This pointed to an issue with the session.
>> After investigating the code, I think what's happening is that a given >> Lift session is actually poisoned by one of these invalid requests: the >> request breaks the session's makeCometBreakoutDecision (by throwing an NPE >> when we try to access the request's underlying server name), which in turn >> breaks LiftRules._getLiftSession (see >> https://github.com/lift/framework/blob/master/web/webkit/src/main/sca...), which, finally, breaks any stateful request (see >> https://github.com/lift/framework/blob/master/web/webkit/src/main/sca...). Because this happens in what is essentially the process that decides >> which comet requests to go ahead and drop, we get into a situation where >> the broken request is never removed, therefore the LiftSession is >> “poisoned”: it is unable to process requests, and unable to get out of the >> state where it cannot process requests.
>> I'll be the first to admit that my familiarity with the comet code isn't >> 100%, so I'm going to propose a solution and see if anyone thinks it's a >> catastrophic idea before trying it; first, some details:
>> - In LiftRules.makeCometBreakoutDecision, we call session.cometForHost. >> This is where things start throwing exceptions (see >> https://github.com/lift/framework/blob/master/web/webkit/src/main/sca...).
>> - This is currently only used in makeCometBreakoutDecision, so its sole >> purpose is basically to determine which requests to inspect to see if they >> need to be dropped (this mechanism exists so that we don't try to keep >> alive more requests than a given browser can support for a given host, by >> the way).
>> - That list is then inspected and the oldest N comet requests are >> dropped, where N is the difference between open requests and allowed >> requests for the given browser.
>> I propose we change session.cometForHost to return a tuple or case class. >> This will represent (a) available, valid comets and (b) comets with dead >> requests that need to be dropped immediately. makeCometBreakoutDecision >> will then ask the appropriate ContinuationActors to BreakOut() (a) if >> they've been deemed invalid and then (b) if there are too many requests >> still open after that. We would deem a comet invalid if r.hostAndPath >> throws an exception of any sort.
>> The purpose here isn't to completely cure the NPE issue, but to make sure >> it does not poison the LiftSession, so that further requests can still >> work. This should also make NPEs recoverable to some extent, in that I >> believe as long as your CometActor's lifeSpan isn't too low, a subsequent >> request should be able to catch and properly lead to the appropriate >> CometActor.
>> Let me know if there's anything I can clarify, or if anything here seems >> like it would break things terribly.
>> Thanks,
>> Antonio
I've tested this locally and pushed the result to master. It doesn't seem to have broken existing comet code, but I have no confirmation that it fixes the issues I think it fixes. We can get this on our production servers either tomorrow or Monday (depending on a couple of other things) once Jenkins has spun a new version of 2.5-SNAPSHOT, if you'd prefer to have it battle-tested in at least one place before spinning M1.
Thanks,
Antonio
On Thursday, June 7, 2012 1:52:53 PM UTC-4, David Pollak wrote:
> +1
> Thank you very, very much for chasing this issue down.
> We haven't built M1, so go ahead and push it to master as soon as you want.
> On Thu, Jun 7, 2012 at 7:52 AM, Antonio Salazar Cardozo <
> savedfastc...@gmail.com> wrote:
>> Hey folks,
>> In my eternal quest against comet-related NullPointerExceptions, I did >> another deep dive of the comet code with some help from a handy jdb stack >> trace for one of the aforementioned NPEs. My discovery might help some >> things. First, a quick summary of some new discoveries on our part:
>> When the server starts hitting NPEs (we've previously found that these >> happen when the server is overloaded enough with requests that it starts >> hitting requests that have been expired by the container, and therefore are >> no longer valid), we noticed that we were a bit hosed if we tried to reload >> the page, but if we cleared our cookies things worked significantly better. >> This pointed to an issue with the session.
>> After investigating the code, I think what's happening is that a given >> Lift session is actually poisoned by one of these invalid requests: the >> request breaks the session's makeCometBreakoutDecision (by throwing an NPE >> when we try to access the request's underlying server name), which in turn >> breaks LiftRules._getLiftSession (see >> https://github.com/lift/framework/blob/master/web/webkit/src/main/sca...), which, finally, breaks any stateful request (see >> https://github.com/lift/framework/blob/master/web/webkit/src/main/sca...). Because this happens in what is essentially the process that decides >> which comet requests to go ahead and drop, we get into a situation where >> the broken request is never removed, therefore the LiftSession is >> “poisoned”: it is unable to process requests, and unable to get out of the >> state where it cannot process requests.
>> I'll be the first to admit that my familiarity with the comet code isn't >> 100%, so I'm going to propose a solution and see if anyone thinks it's a >> catastrophic idea before trying it; first, some details:
>> - In LiftRules.makeCometBreakoutDecision, we call session.cometForHost. >> This is where things start throwing exceptions (see >> https://github.com/lift/framework/blob/master/web/webkit/src/main/sca...).
>> - This is currently only used in makeCometBreakoutDecision, so its sole >> purpose is basically to determine which requests to inspect to see if they >> need to be dropped (this mechanism exists so that we don't try to keep >> alive more requests than a given browser can support for a given host, by >> the way).
>> - That list is then inspected and the oldest N comet requests are >> dropped, where N is the difference between open requests and allowed >> requests for the given browser.
>> I propose we change session.cometForHost to return a tuple or case class. >> This will represent (a) available, valid comets and (b) comets with dead >> requests that need to be dropped immediately. makeCometBreakoutDecision >> will then ask the appropriate ContinuationActors to BreakOut() (a) if >> they've been deemed invalid and then (b) if there are too many requests >> still open after that. We would deem a comet invalid if r.hostAndPath >> throws an exception of any sort.
>> The purpose here isn't to completely cure the NPE issue, but to make sure >> it does not poison the LiftSession, so that further requests can still >> work. This should also make NPEs recoverable to some extent, in that I >> believe as long as your CometActor's lifeSpan isn't too low, a subsequent >> request should be able to catch and properly lead to the appropriate >> CometActor.
>> Let me know if there's anything I can clarify, or if anything here seems >> like it would break things terribly.
>> Thanks,
>> Antonio
<savedfastc...@gmail.com> wrote:
> I've tested this locally and pushed the result to master. It doesn't seem to
> have broken existing comet code, but I have no confirmation that it fixes
> the issues I think it fixes. We can get this on our production servers
> either tomorrow or Monday (depending on a couple of other things) once
> Jenkins has spun a new version of 2.5-SNAPSHOT, if you'd prefer to have it
> battle-tested in at least one place before spinning M1.
> Thanks,
> Antonio
> On Thursday, June 7, 2012 1:52:53 PM UTC-4, David Pollak wrote:
>> +1
>> Thank you very, very much for chasing this issue down.
>> We haven't built M1, so go ahead and push it to master as soon as you
>> want.
>> On Thu, Jun 7, 2012 at 7:52 AM, Antonio Salazar Cardozo
>> <savedfastc...@gmail.com> wrote:
>>> Hey folks,
>>> In my eternal quest against comet-related NullPointerExceptions, I did
>>> another deep dive of the comet code with some help from a handy jdb stack
>>> trace for one of the aforementioned NPEs. My discovery might help some
>>> things. First, a quick summary of some new discoveries on our part:
>>> When the server starts hitting NPEs (we've previously found that these
>>> happen when the server is overloaded enough with requests that it starts
>>> hitting requests that have been expired by the container, and therefore are
>>> no longer valid), we noticed that we were a bit hosed if we tried to reload
>>> the page, but if we cleared our cookies things worked significantly better.
>>> This pointed to an issue with the session.
>>> After investigating the code, I think what's happening is that a given
>>> Lift session is actually poisoned by one of these invalid requests: the
>>> request breaks the session's makeCometBreakoutDecision (by throwing an NPE
>>> when we try to access the request's underlying server name), which in turn
>>> breaks LiftRules._getLiftSession (see
>>> https://github.com/lift/framework/blob/master/web/webkit/src/main/sca... >>> ), which, finally, breaks any stateful request (see
>>> https://github.com/lift/framework/blob/master/web/webkit/src/main/sca... >>> ). Because this happens in what is essentially the process that decides
>>> which comet requests to go ahead and drop, we get into a situation where the
>>> broken request is never removed, therefore the LiftSession is “poisoned”: it
>>> is unable to process requests, and unable to get out of the state where it
>>> cannot process requests.
>>> I'll be the first to admit that my familiarity with the comet code isn't
>>> 100%, so I'm going to propose a solution and see if anyone thinks it's a
>>> catastrophic idea before trying it; first, some details:
>>> - In LiftRules.makeCometBreakoutDecision, we call session.cometForHost.
>>> This is where things start throwing exceptions (see
>>> https://github.com/lift/framework/blob/master/web/webkit/src/main/sca... >>> ).
>>> - This is currently only used in makeCometBreakoutDecision, so its sole
>>> purpose is basically to determine which requests to inspect to see if they
>>> need to be dropped (this mechanism exists so that we don't try to keep alive
>>> more requests than a given browser can support for a given host, by the
>>> way).
>>> - That list is then inspected and the oldest N comet requests are
>>> dropped, where N is the difference between open requests and allowed
>>> requests for the given browser.
>>> I propose we change session.cometForHost to return a tuple or case class.
>>> This will represent (a) available, valid comets and (b) comets with dead
>>> requests that need to be dropped immediately. makeCometBreakoutDecision will
>>> then ask the appropriate ContinuationActors to BreakOut() (a) if they've
>>> been deemed invalid and then (b) if there are too many requests still open
>>> after that. We would deem a comet invalid if r.hostAndPath throws an
>>> exception of any sort.
>>> The purpose here isn't to completely cure the NPE issue, but to make sure
>>> it does not poison the LiftSession, so that further requests can still work.
>>> This should also make NPEs recoverable to some extent, in that I believe as
>>> long as your CometActor's lifeSpan isn't too low, a subsequent request
>>> should be able to catch and properly lead to the appropriate CometActor.
>>> Let me know if there's anything I can clarify, or if anything here seems
>>> like it would break things terribly.
>>> Thanks,
>>> Antonio
> On Thu, Jun 7, 2012 at 6:38 PM, Antonio Salazar Cardozo
> <savedfastc...@gmail.com> wrote:
> > I've tested this locally and pushed the result to master. It doesn't
> seem to
> > have broken existing comet code, but I have no confirmation that it fixes
> > the issues I think it fixes. We can get this on our production servers
> > either tomorrow or Monday (depending on a couple of other things) once
> > Jenkins has spun a new version of 2.5-SNAPSHOT, if you'd prefer to have
> it
> > battle-tested in at least one place before spinning M1.
> > Thanks,
> > Antonio
> > On Thursday, June 7, 2012 1:52:53 PM UTC-4, David Pollak wrote:
> >> +1
> >> Thank you very, very much for chasing this issue down.
> >> We haven't built M1, so go ahead and push it to master as soon as you
> >> want.
> >> On Thu, Jun 7, 2012 at 7:52 AM, Antonio Salazar Cardozo
> >> <savedfastc...@gmail.com> wrote:
> >>> Hey folks,
> >>> In my eternal quest against comet-related NullPointerExceptions, I did
> >>> another deep dive of the comet code with some help from a handy jdb
> stack
> >>> trace for one of the aforementioned NPEs. My discovery might help some
> >>> things. First, a quick summary of some new discoveries on our part:
> >>> When the server starts hitting NPEs (we've previously found that these
> >>> happen when the server is overloaded enough with requests that it
> starts
> >>> hitting requests that have been expired by the container, and
> therefore are
> >>> no longer valid), we noticed that we were a bit hosed if we tried to
> reload
> >>> the page, but if we cleared our cookies things worked significantly
> better.
> >>> This pointed to an issue with the session.
> >>> After investigating the code, I think what's happening is that a given
> >>> Lift session is actually poisoned by one of these invalid requests: the
> >>> request breaks the session's makeCometBreakoutDecision (by throwing an
> NPE
> >>> when we try to access the request's underlying server name), which in
> turn
> >>> breaks LiftRules._getLiftSession (see
> https://github.com/lift/framework/blob/master/web/webkit/src/main/sca... > >>> ). Because this happens in what is essentially the process that decides
> >>> which comet requests to go ahead and drop, we get into a situation
> where the
> >>> broken request is never removed, therefore the LiftSession is
> “poisoned”: it
> >>> is unable to process requests, and unable to get out of the state
> where it
> >>> cannot process requests.
> >>> I'll be the first to admit that my familiarity with the comet code
> isn't
> >>> 100%, so I'm going to propose a solution and see if anyone thinks it's
> a
> >>> catastrophic idea before trying it; first, some details:
> >>> - In LiftRules.makeCometBreakoutDecision, we call
> session.cometForHost.
> >>> This is where things start throwing exceptions (see
> https://github.com/lift/framework/blob/master/web/webkit/src/main/sca... > >>> ).
> >>> - This is currently only used in makeCometBreakoutDecision, so its
> sole
> >>> purpose is basically to determine which requests to inspect to see if
> they
> >>> need to be dropped (this mechanism exists so that we don't try to keep
> alive
> >>> more requests than a given browser can support for a given host, by the
> >>> way).
> >>> - That list is then inspected and the oldest N comet requests are
> >>> dropped, where N is the difference between open requests and allowed
> >>> requests for the given browser.
> >>> I propose we change session.cometForHost to return a tuple or case
> class.
> >>> This will represent (a) available, valid comets and (b) comets with
> dead
> >>> requests that need to be dropped immediately.
> makeCometBreakoutDecision will
> >>> then ask the appropriate ContinuationActors to BreakOut() (a) if
> they've
> >>> been deemed invalid and then (b) if there are too many requests still
> open
> >>> after that. We would deem a comet invalid if r.hostAndPath throws an
> >>> exception of any sort.
> >>> The purpose here isn't to completely cure the NPE issue, but to make
> sure
> >>> it does not poison the LiftSession, so that further requests can still
> work.
> >>> This should also make NPEs recoverable to some extent, in that I
> believe as
> >>> long as your CometActor's lifeSpan isn't too low, a subsequent request
> >>> should be able to catch and properly lead to the appropriate
> CometActor.
> >>> Let me know if there's anything I can clarify, or if anything here
> seems
> >>> like it would break things terribly.
> >>> Thanks,
> >>> Antonio
> This is epic work and deserving of a Happy Lift'r or three!
> Peter
> On Thursday, June 7, 2012 4:52:06 PM UTC+2, Antonio Salazar Cardozo wrote:
> Hey folks,
> In my eternal quest against comet-related NullPointerExceptions, I did another deep dive of the comet code with some help from a handy jdb stack trace for one of the aforementioned NPEs. My discovery might help some things. First, a quick summary of some new discoveries on our part:
> When the server starts hitting NPEs (we've previously found that these happen when the server is overloaded enough with requests that it starts hitting requests that have been expired by the container, and therefore are no longer valid), we noticed that we were a bit hosed if we tried to reload the page, but if we cleared our cookies things worked significantly better. This pointed to an issue with the session.
> After investigating the code, I think what's happening is that a given Lift session is actually poisoned by one of these invalid requests: the request breaks the session's makeCometBreakoutDecision (by throwing an NPE when we try to access the request's underlying server name), which in turn breaks LiftRules._getLiftSession (see https://github.com/lift/framework/blob/master/web/webkit/src/main/sca... ), which, finally, breaks any stateful request (see https://github.com/lift/framework/blob/master/web/webkit/src/main/sca... ). Because this happens in what is essentially the process that decides which comet requests to go ahead and drop, we get into a situation where the broken request is never removed, therefore the LiftSession is “poisoned”: it is unable to process requests, and unable to get out of the state where it cannot process requests.
> I'll be the first to admit that my familiarity with the comet code isn't 100%, so I'm going to propose a solution and see if anyone thinks it's a catastrophic idea before trying it; first, some details:
> - In LiftRules.makeCometBreakoutDecision, we call session.cometForHost. This is where things start throwing exceptions (see https://github.com/lift/framework/blob/master/web/webkit/src/main/sca... ).
> - This is currently only used in makeCometBreakoutDecision, so its sole purpose is basically to determine which requests to inspect to see if they need to be dropped (this mechanism exists so that we don't try to keep alive more requests than a given browser can support for a given host, by the way).
> - That list is then inspected and the oldest N comet requests are dropped, where N is the difference between open requests and allowed requests for the given browser.
> I propose we change session.cometForHost to return a tuple or case class. This will represent (a) available, valid comets and (b) comets with dead requests that need to be dropped immediately. makeCometBreakoutDecision will then ask the appropriate ContinuationActors to BreakOut() (a) if they've been deemed invalid and then (b) if there are too many requests still open after that. We would deem a comet invalid if r.hostAndPath throws an exception of any sort.
> The purpose here isn't to completely cure the NPE issue, but to make sure it does not poison the LiftSession, so that further requests can still work. This should also make NPEs recoverable to some extent, in that I believe as long as your CometActor's lifeSpan isn't too low, a subsequent request should be able to catch and properly lead to the appropriate CometActor.
> Let me know if there's anything I can clarify, or if anything here seems like it would break things terribly.
> Thanks,
> Antonio
We've been running this for ~16 hours or so on our production server, everything seems to be running smoothly. Haven't seen any NPEs either, but that often takes a couple of days after a restart to manifest. The important thing I was looking for is that nothing seems to have broken :)
Thanks,
Antonio
On Thursday, June 7, 2012 6:38:29 PM UTC-4, Antonio Salazar Cardozo wrote:
> I've tested this locally and pushed the result to master. It doesn't seem > to have broken existing comet code, but I have no confirmation that it > fixes the issues I think it fixes. We can get this on our production > servers either tomorrow or Monday (depending on a couple of other things) > once Jenkins has spun a new version of 2.5-SNAPSHOT, if you'd prefer to > have it battle-tested in at least one place before spinning M1.
> Thanks,
> Antonio
> On Thursday, June 7, 2012 1:52:53 PM UTC-4, David Pollak wrote:
>> +1
>> Thank you very, very much for chasing this issue down.
>> We haven't built M1, so go ahead and push it to master as soon as you >> want.
>> On Thu, Jun 7, 2012 at 7:52 AM, Antonio Salazar Cardozo <
>> savedfastc...@gmail.com> wrote:
>>> Hey folks,
>>> In my eternal quest against comet-related NullPointerExceptions, I did >>> another deep dive of the comet code with some help from a handy jdb stack >>> trace for one of the aforementioned NPEs. My discovery might help some >>> things. First, a quick summary of some new discoveries on our part:
>>> When the server starts hitting NPEs (we've previously found that these >>> happen when the server is overloaded enough with requests that it starts >>> hitting requests that have been expired by the container, and therefore are >>> no longer valid), we noticed that we were a bit hosed if we tried to reload >>> the page, but if we cleared our cookies things worked significantly better. >>> This pointed to an issue with the session.
>>> After investigating the code, I think what's happening is that a given >>> Lift session is actually poisoned by one of these invalid requests: the >>> request breaks the session's makeCometBreakoutDecision (by throwing an NPE >>> when we try to access the request's underlying server name), which in turn >>> breaks LiftRules._getLiftSession (see >>> https://github.com/lift/framework/blob/master/web/webkit/src/main/sca...), which, finally, breaks any stateful request (see >>> https://github.com/lift/framework/blob/master/web/webkit/src/main/sca...). Because this happens in what is essentially the process that decides >>> which comet requests to go ahead and drop, we get into a situation where >>> the broken request is never removed, therefore the LiftSession is >>> “poisoned”: it is unable to process requests, and unable to get out of the >>> state where it cannot process requests.
>>> I'll be the first to admit that my familiarity with the comet code isn't >>> 100%, so I'm going to propose a solution and see if anyone thinks it's a >>> catastrophic idea before trying it; first, some details:
>>> - In LiftRules.makeCometBreakoutDecision, we call session.cometForHost. >>> This is where things start throwing exceptions (see >>> https://github.com/lift/framework/blob/master/web/webkit/src/main/sca...).
>>> - This is currently only used in makeCometBreakoutDecision, so its sole >>> purpose is basically to determine which requests to inspect to see if they >>> need to be dropped (this mechanism exists so that we don't try to keep >>> alive more requests than a given browser can support for a given host, by >>> the way).
>>> - That list is then inspected and the oldest N comet requests are >>> dropped, where N is the difference between open requests and allowed >>> requests for the given browser.
>>> I propose we change session.cometForHost to return a tuple or case >>> class. This will represent (a) available, valid comets and (b) comets with >>> dead requests that need to be dropped immediately. >>> makeCometBreakoutDecision will then ask the appropriate ContinuationActors >>> to BreakOut() (a) if they've been deemed invalid and then (b) if there are >>> too many requests still open after that. We would deem a comet invalid if >>> r.hostAndPath throws an exception of any sort.
>>> The purpose here isn't to completely cure the NPE issue, but to make >>> sure it does not poison the LiftSession, so that further requests can still >>> work. This should also make NPEs recoverable to some extent, in that I >>> believe as long as your CometActor's lifeSpan isn't too low, a subsequent >>> request should be able to catch and properly lead to the appropriate >>> CometActor.
>>> Let me know if there's anything I can clarify, or if anything here seems >>> like it would break things terribly.
>>> Thanks,
>>> Antonio
Last round: this morning during a rather hairy deploy, we saw the kind of load that generally stuffs us full of these NPEs, and not a one was to be seen. Not foolproof evidence, but as far as we can tell, this problem has been put to bed! :) We'll report back if we discover otherwise.
Thanks,
Antonio
On Monday, June 11, 2012 3:57:09 PM UTC-4, Antonio Salazar Cardozo wrote:
> We've been running this for ~16 hours or so on our production server, > everything seems to be running smoothly. Haven't seen any NPEs either, but > that often takes a couple of days after a restart to manifest. The > important thing I was looking for is that nothing seems to have broken :)
> Thanks,
> Antonio
> On Thursday, June 7, 2012 6:38:29 PM UTC-4, Antonio Salazar Cardozo wrote:
>> I've tested this locally and pushed the result to master. It doesn't seem >> to have broken existing comet code, but I have no confirmation that it >> fixes the issues I think it fixes. We can get this on our production >> servers either tomorrow or Monday (depending on a couple of other things) >> once Jenkins has spun a new version of 2.5-SNAPSHOT, if you'd prefer to >> have it battle-tested in at least one place before spinning M1.
>> Thanks,
>> Antonio
>> On Thursday, June 7, 2012 1:52:53 PM UTC-4, David Pollak wrote:
>>> +1
>>> Thank you very, very much for chasing this issue down.
>>> We haven't built M1, so go ahead and push it to master as soon as you >>> want.
>>> On Thu, Jun 7, 2012 at 7:52 AM, Antonio Salazar Cardozo <
>>> savedfastc...@gmail.com> wrote:
>>>> Hey folks,
>>>> In my eternal quest against comet-related NullPointerExceptions, I did >>>> another deep dive of the comet code with some help from a handy jdb stack >>>> trace for one of the aforementioned NPEs. My discovery might help some >>>> things. First, a quick summary of some new discoveries on our part:
>>>> When the server starts hitting NPEs (we've previously found that these >>>> happen when the server is overloaded enough with requests that it starts >>>> hitting requests that have been expired by the container, and therefore are >>>> no longer valid), we noticed that we were a bit hosed if we tried to reload >>>> the page, but if we cleared our cookies things worked significantly better. >>>> This pointed to an issue with the session.
>>>> After investigating the code, I think what's happening is that a given >>>> Lift session is actually poisoned by one of these invalid requests: the >>>> request breaks the session's makeCometBreakoutDecision (by throwing an NPE >>>> when we try to access the request's underlying server name), which in turn >>>> breaks LiftRules._getLiftSession (see >>>> https://github.com/lift/framework/blob/master/web/webkit/src/main/sca...), which, finally, breaks any stateful request (see >>>> https://github.com/lift/framework/blob/master/web/webkit/src/main/sca...). Because this happens in what is essentially the process that decides >>>> which comet requests to go ahead and drop, we get into a situation where >>>> the broken request is never removed, therefore the LiftSession is >>>> “poisoned”: it is unable to process requests, and unable to get out of the >>>> state where it cannot process requests.
>>>> I'll be the first to admit that my familiarity with the comet code >>>> isn't 100%, so I'm going to propose a solution and see if anyone thinks >>>> it's a catastrophic idea before trying it; first, some details:
>>>> - In LiftRules.makeCometBreakoutDecision, we call >>>> session.cometForHost. This is where things start throwing exceptions (see >>>> https://github.com/lift/framework/blob/master/web/webkit/src/main/sca...).
>>>> - This is currently only used in makeCometBreakoutDecision, so its >>>> sole purpose is basically to determine which requests to inspect to see if >>>> they need to be dropped (this mechanism exists so that we don't try to keep >>>> alive more requests than a given browser can support for a given host, by >>>> the way).
>>>> - That list is then inspected and the oldest N comet requests are >>>> dropped, where N is the difference between open requests and allowed >>>> requests for the given browser.
>>>> I propose we change session.cometForHost to return a tuple or case >>>> class. This will represent (a) available, valid comets and (b) comets with >>>> dead requests that need to be dropped immediately. >>>> makeCometBreakoutDecision will then ask the appropriate ContinuationActors >>>> to BreakOut() (a) if they've been deemed invalid and then (b) if there are >>>> too many requests still open after that. We would deem a comet invalid if >>>> r.hostAndPath throws an exception of any sort.
>>>> The purpose here isn't to completely cure the NPE issue, but to make >>>> sure it does not poison the LiftSession, so that further requests can still >>>> work. This should also make NPEs recoverable to some extent, in that I >>>> believe as long as your CometActor's lifeSpan isn't too low, a subsequent >>>> request should be able to catch and properly lead to the appropriate >>>> CometActor.
>>>> Let me know if there's anything I can clarify, or if anything here >>>> seems like it would break things terribly.
>>>> Thanks,
>>>> Antonio
<savedfastc...@gmail.com> wrote:
> Last round: this morning during a rather hairy deploy, we saw the kind of
> load that generally stuffs us full of these NPEs, and not a one was to be
> seen. Not foolproof evidence, but as far as we can tell, this problem has
> been put to bed! :) We'll report back if we discover otherwise.
> Thanks,
> Antonio
> On Monday, June 11, 2012 3:57:09 PM UTC-4, Antonio Salazar Cardozo wrote:
>> We've been running this for ~16 hours or so on our production server,
>> everything seems to be running smoothly. Haven't seen any NPEs either, but
>> that often takes a couple of days after a restart to manifest. The important
>> thing I was looking for is that nothing seems to have broken :)
>> Thanks,
>> Antonio
>> On Thursday, June 7, 2012 6:38:29 PM UTC-4, Antonio Salazar Cardozo wrote:
>>> I've tested this locally and pushed the result to master. It doesn't seem
>>> to have broken existing comet code, but I have no confirmation that it fixes
>>> the issues I think it fixes. We can get this on our production servers
>>> either tomorrow or Monday (depending on a couple of other things) once
>>> Jenkins has spun a new version of 2.5-SNAPSHOT, if you'd prefer to have it
>>> battle-tested in at least one place before spinning M1.
>>> Thanks,
>>> Antonio
>>> On Thursday, June 7, 2012 1:52:53 PM UTC-4, David Pollak wrote:
>>>> +1
>>>> Thank you very, very much for chasing this issue down.
>>>> We haven't built M1, so go ahead and push it to master as soon as you
>>>> want.
>>>> On Thu, Jun 7, 2012 at 7:52 AM, Antonio Salazar Cardozo
>>>> <savedfastc...@gmail.com> wrote:
>>>>> Hey folks,
>>>>> In my eternal quest against comet-related NullPointerExceptions, I did
>>>>> another deep dive of the comet code with some help from a handy jdb stack
>>>>> trace for one of the aforementioned NPEs. My discovery might help some
>>>>> things. First, a quick summary of some new discoveries on our part:
>>>>> When the server starts hitting NPEs (we've previously found that these
>>>>> happen when the server is overloaded enough with requests that it starts
>>>>> hitting requests that have been expired by the container, and therefore are
>>>>> no longer valid), we noticed that we were a bit hosed if we tried to reload
>>>>> the page, but if we cleared our cookies things worked significantly better.
>>>>> This pointed to an issue with the session.
>>>>> After investigating the code, I think what's happening is that a given
>>>>> Lift session is actually poisoned by one of these invalid requests: the
>>>>> request breaks the session's makeCometBreakoutDecision (by throwing an NPE
>>>>> when we try to access the request's underlying server name), which in turn
>>>>> breaks LiftRules._getLiftSession (see
>>>>> https://github.com/lift/framework/blob/master/web/webkit/src/main/sca... >>>>> ), which, finally, breaks any stateful request (see
>>>>> https://github.com/lift/framework/blob/master/web/webkit/src/main/sca... >>>>> ). Because this happens in what is essentially the process that decides
>>>>> which comet requests to go ahead and drop, we get into a situation where the
>>>>> broken request is never removed, therefore the LiftSession is “poisoned”: it
>>>>> is unable to process requests, and unable to get out of the state where it
>>>>> cannot process requests.
>>>>> I'll be the first to admit that my familiarity with the comet code
>>>>> isn't 100%, so I'm going to propose a solution and see if anyone thinks it's
>>>>> a catastrophic idea before trying it; first, some details:
>>>>> - In LiftRules.makeCometBreakoutDecision, we call
>>>>> session.cometForHost. This is where things start throwing exceptions (see
>>>>> https://github.com/lift/framework/blob/master/web/webkit/src/main/sca... >>>>> ).
>>>>> - This is currently only used in makeCometBreakoutDecision, so its
>>>>> sole purpose is basically to determine which requests to inspect to see if
>>>>> they need to be dropped (this mechanism exists so that we don't try to keep
>>>>> alive more requests than a given browser can support for a given host, by
>>>>> the way).
>>>>> - That list is then inspected and the oldest N comet requests are
>>>>> dropped, where N is the difference between open requests and allowed
>>>>> requests for the given browser.
>>>>> I propose we change session.cometForHost to return a tuple or case
>>>>> class. This will represent (a) available, valid comets and (b) comets with
>>>>> dead requests that need to be dropped immediately. makeCometBreakoutDecision
>>>>> will then ask the appropriate ContinuationActors to BreakOut() (a) if
>>>>> they've been deemed invalid and then (b) if there are too many requests
>>>>> still open after that. We would deem a comet invalid if r.hostAndPath throws
>>>>> an exception of any sort.
>>>>> The purpose here isn't to completely cure the NPE issue, but to make
>>>>> sure it does not poison the LiftSession, so that further requests can still
>>>>> work. This should also make NPEs recoverable to some extent, in that I
>>>>> believe as long as your CometActor's lifeSpan isn't too low, a subsequent
>>>>> request should be able to catch and properly lead to the appropriate
>>>>> CometActor.
>>>>> Let me know if there's anything I can clarify, or if anything here
>>>>> seems like it would break things terribly.
>>>>> Thanks,
>>>>> Antonio
On Tue, Jun 12, 2012 at 11:35 AM, Antonio Salazar Cardozo <
savedfastc...@gmail.com> wrote:
> Last round: this morning during a rather hairy deploy, we saw the kind of
> load that generally stuffs us full of these NPEs, and not a one was to be
> seen. Not foolproof evidence, but as far as we can tell, this problem has
> been put to bed! :) We'll report back if we discover otherwise.
> On Monday, June 11, 2012 3:57:09 PM UTC-4, Antonio Salazar Cardozo wrote:
>> We've been running this for ~16 hours or so on our production server,
>> everything seems to be running smoothly. Haven't seen any NPEs either, but
>> that often takes a couple of days after a restart to manifest. The
>> important thing I was looking for is that nothing seems to have broken :)
>> Thanks,
>> Antonio
>> On Thursday, June 7, 2012 6:38:29 PM UTC-4, Antonio Salazar Cardozo wrote:
>>> I've tested this locally and pushed the result to master. It doesn't
>>> seem to have broken existing comet code, but I have no confirmation that it
>>> fixes the issues I think it fixes. We can get this on our production
>>> servers either tomorrow or Monday (depending on a couple of other things)
>>> once Jenkins has spun a new version of 2.5-SNAPSHOT, if you'd prefer to
>>> have it battle-tested in at least one place before spinning M1.
>>> Thanks,
>>> Antonio
>>> On Thursday, June 7, 2012 1:52:53 PM UTC-4, David Pollak wrote:
>>>> +1
>>>> Thank you very, very much for chasing this issue down.
>>>> We haven't built M1, so go ahead and push it to master as soon as you
>>>> want.
>>>> On Thu, Jun 7, 2012 at 7:52 AM, Antonio Salazar Cardozo <
>>>> savedfastc...@gmail.com> wrote:
>>>>> Hey folks,
>>>>> In my eternal quest against comet-related NullPointerExceptions, I did
>>>>> another deep dive of the comet code with some help from a handy jdb stack
>>>>> trace for one of the aforementioned NPEs. My discovery might help some
>>>>> things. First, a quick summary of some new discoveries on our part:
>>>>> When the server starts hitting NPEs (we've previously found that these
>>>>> happen when the server is overloaded enough with requests that it starts
>>>>> hitting requests that have been expired by the container, and therefore are
>>>>> no longer valid), we noticed that we were a bit hosed if we tried to reload
>>>>> the page, but if we cleared our cookies things worked significantly better.
>>>>> This pointed to an issue with the session.
>>>>> After investigating the code, I think what's happening is that a given
>>>>> Lift session is actually poisoned by one of these invalid requests: the
>>>>> request breaks the session's makeCometBreakoutDecision (by throwing an NPE
>>>>> when we try to access the request's underlying server name), which in turn
>>>>> breaks LiftRules._getLiftSession (see https://github.com/lift/** >>>>> framework/blob/master/web/**webkit/src/main/scala/net/**
>>>>> liftweb/http/LiftRules.scala#**L293<https://github.com/lift/framework/blob/master/web/webkit/src/main/sca...>), which, finally, breaks any stateful request (see
>>>>> https://github.com/lift/**framework/blob/master/web/** >>>>> webkit/src/main/scala/net/**liftweb/http/LiftServlet.**scala#L273<https://github.com/lift/framework/blob/master/web/webkit/src/main/sca...>). Because this happens in what is essentially the process that decides
>>>>> which comet requests to go ahead and drop, we get into a situation where
>>>>> the broken request is never removed, therefore the LiftSession is
>>>>> “poisoned”: it is unable to process requests, and unable to get out of the
>>>>> state where it cannot process requests.
>>>>> I'll be the first to admit that my familiarity with the comet code
>>>>> isn't 100%, so I'm going to propose a solution and see if anyone thinks
>>>>> it's a catastrophic idea before trying it; first, some details:
>>>>> - In LiftRules.**makeCometBreakoutDecision, we call
>>>>> session.cometForHost. This is where things start throwing exceptions (see
>>>>> https://github.com/lift/**framework/blob/master/web/** >>>>> webkit/src/main/scala/net/**liftweb/http/LiftSession.**scala#L615<https://github.com/lift/framework/blob/master/web/webkit/src/main/sca...>).
>>>>> - This is currently only used in makeCometBreakoutDecision, so its
>>>>> sole purpose is basically to determine which requests to inspect to see if
>>>>> they need to be dropped (this mechanism exists so that we don't try to keep
>>>>> alive more requests than a given browser can support for a given host, by
>>>>> the way).
>>>>> - That list is then inspected and the oldest N comet requests are
>>>>> dropped, where N is the difference between open requests and allowed
>>>>> requests for the given browser.
>>>>> I propose we change session.cometForHost to return a tuple or case
>>>>> class. This will represent (a) available, valid comets and (b) comets with
>>>>> dead requests that need to be dropped immediately.
>>>>> makeCometBreakoutDecision will then ask the appropriate ContinuationActors
>>>>> to BreakOut() (a) if they've been deemed invalid and then (b) if there are
>>>>> too many requests still open after that. We would deem a comet invalid if
>>>>> r.hostAndPath throws an exception of any sort.
>>>>> The purpose here isn't to completely cure the NPE issue, but to make
>>>>> sure it does not poison the LiftSession, so that further requests can still
>>>>> work. This should also make NPEs recoverable to some extent, in that I
>>>>> believe as long as your CometActor's lifeSpan isn't too low, a subsequent
>>>>> request should be able to catch and properly lead to the appropriate
>>>>> CometActor.
>>>>> Let me know if there's anything I can clarify, or if anything here
>>>>> seems like it would break things terribly.
>>>>> Thanks,
>>>>> Antonio