Race-condition in asynchronous servlet handling

David Waern

unread,

May 8, 2017, 9:58:06 AM5/8/17

to Lift

Hi,

My client has an application running Lift 3.0.1 and Jetty 9.2.2. They have a problem with requests that "hang" for a very long time before they are completed with empty responses.

I think this could be due to a race-condition in the method handleGenericContinuation of the class LiftServlet. This method (directly and indirectly) calls the suspend and resume methods of the ServletAsyncProvider interface but does not guarantee that suspend is called before resume. As far as I can see, the order of the calls can depend on the scheduling of threads.

This seems to cause problems both when using the default Servlet30AsyncProvider class and when using the Jetty7AsyncProvider class. With both classes we see requests being suspended but never resumed, causing Jetty to time them out after the configured timeout (2 minutes in our case), and then re-scheduling them. Using Servlet30AsyncProvider additionally causes NullPointerExceptions.

Side note: When requests are re-scheduled, they are processed again from scratch. However when using Jetty7AsyncProvider the re-processed result is discarded and an empty result is returned to the client due to the fact that the suspend method does not suspend the request when the "expired" flag is raised -- I don't know if this also happens with Servlet30AsyncProvider.

This impacts both our production and test environment but it is much more visible in the test environment, presumably because it has a weaker CPU which does more context-switching, triggering the race-condition more often.

As a workaround we have stopped using the asynchronous response feature of Lift and switched to normal synchronous responses.

I've seen various traces of other people having similar problems on mailing list threads and issue trackers, but I haven't found an issue for this in the github bug tracker, so I'm reporting it here.

Thanks,
Davifd

Antonio Salazar Cardozo

unread,

May 8, 2017, 10:17:27 AM5/8/17

to Lift

Hmmm, so the way I see this happening is if it takes >5ms for the original request-

handling thread to go from line 117 to line 124 in LiftServlet (because we first try

fetching the response 5ms after 117 runs) and the 5ms-delayed scheduled item

in runFunction is scheduled ahead of the next line in LiftServlet after those 5ms

have concluded. While I can see this being viable, I find it somewhat unlikely…

Did you insert tracing and observe that this is definitely what was happening?

Please keep in mind that it is normal for comet requests to do nothing for 2 minutes

and then re-issue if there is nothing to push to the client. That is the long-poll

nature of Lift's comet interactions. I realize you may already know that, but want

to make sure to cover all possibilities :)

If, however, this is indeed what you're seeing, and this is the root of the NPEs we've

been seeing for some time… Let's just say I'm going to be pretty excited, as fixing those

for good would be ideal.

Andreas may also be able to chime in, since he's been seeing those NPEs recently if

I remember correctly.

Thanks,

Antonio

David Waern

unread,

May 8, 2017, 11:27:51 AM5/8/17

to Lift

On May 8, 2017, at 11:18, David Waern <david...@gmail.com> wrote:

Hi Antonio,

On Monday, 8 May 2017 16:17:27 UTC+2, Antonio Salazar Cardozo wrote:

Hmmm, so the way I see this happening is if it takes >5ms for the original request-
handling thread to go from line 117 to line 124 in LiftServlet (because we first try
fetching the response 5ms after 117 runs) and the 5ms-delayed scheduled item
in runFunction is scheduled ahead of the next line in LiftServlet after those 5ms
have concluded. While I can see this being viable, I find it somewhat unlikely…
Did you insert tracing and observe that this is definitely what was happening?

It may seem unlikely to happen often, but all it takes is a context-switch from the original thread to another thread right before suspend is called and that execution is not returned to the original thread before resume is called. In our case the hanging requests happen only on pages which trigger many requests to the server. Also, in our application, the processing of a request is often finished already when the asynchronous response mechanism is invoked which means the runFunction has very little to do (of course, we shouldn't use asynchrous responses in this case, but there is an historical explanation for why our code behaves like this -- let's just pretend we're stress-testing Lift).

I have inserted debug traces and also attached a debugger to the server and everything seems to indicate that resume is called before suspend. For example: I can see in the debugger that an exception is caught inside Lift's resume method (in the Jetty7AsyncProvider case -- in the Servlet30AsyncProvider case it is triggered but not caught until it reaches the Schedule class). The exception is thrown by the underlying resume method of the Jetty 7 continuation API in the Jetty7AsyncProvider case, and I can see that a null pointer exception is thrown by the complete method of the underlying Servlet 3.0 API when using Servlet30AsyncProvider). Due to the multiple threads involved it is very hard to be certain of the real order of events (and even harder to use the debugger to determine them), so I cannot say with 100% certainty that it is really this race condition that causes the problem, but I cannot see why it should not be a problem in the long run since there is no guarantee in the code of correct ordering of events.

Please keep in mind that it is normal for comet requests to do nothing for 2 minutes
and then re-issue if there is nothing to push to the client. That is the long-poll
nature of Lift's comet interactions. I realize you may already know that, but want
to make sure to cover all possibilities :)

Right, and I guess that mechanism works by resume never being called at all, because an asynchronous computation has not finished yet? So I think I understand how that is supposed to work (although I'm very new to the Lift framework). But what I'm seeing is that the response function is called immediately (we're not dealing with long-running computations, only simple database queries that return quickly -- in fact, like I said, they are already finished before the asynchronous response mechanism is triggered). I can observe from debug traces that the response function triggers the resume method of the underlying servlet API and I can observe with a breakpoint that it is indeed called but throws an exception (presumably because suspend has not yet been called). This happens immediately, long before the timeout. The timeout then happens as a result of all of this since resume is never called.

If, however, this is indeed what you're seeing, and this is the root of the NPEs we've
been seeing for some time… Let's just say I'm going to be pretty excited, as fixing those
for good would be ideal.

I'm happy that you think this may be the root cause of those. Like I've said I've seen the mailing list threads about them. The NPEs was the first thing I investigated and I managed to get rid of them by switching from the Servlet30AsyncProvider to the Jetty7AsyncProvider. That got rid of them, but it didn't get rid of the hanging requests, and I think now I understand why.

Thanks,
David

Andreas Joseph Krogh

unread,

May 8, 2017, 11:32:38 AM5/8/17

to lif...@googlegroups.com

På mandag 08. mai 2017 kl. 16:17:27, skrev Antonio Salazar Cardozo <savedf...@gmail.com>:

Hmmm, so the way I see this happening is if it takes >5ms for the original request-

handling thread to go from line 117 to line 124 in LiftServlet (because we first try

fetching the response 5ms after 117 runs) and the 5ms-delayed scheduled item

in runFunction is scheduled ahead of the next line in LiftServlet after those 5ms

have concluded. While I can see this being viable, I find it somewhat unlikely…

Did you insert tracing and observe that this is definitely what was happening?

Please keep in mind that it is normal for comet requests to do nothing for 2 minutes

and then re-issue if there is nothing to push to the client. That is the long-poll

nature of Lift's comet interactions. I realize you may already know that, but want

to make sure to cover all possibilities :)

If, however, this is indeed what you're seeing, and this is the root of the NPEs we've

been seeing for some time… Let's just say I'm going to be pretty excited, as fixing those

for good would be ideal.

Andreas may also be able to chime in, since he's been seeing those NPEs recently if

I remember correctly.

Thanks,

Antonio

Hi.

Under heavy load it's more than likely that Schedule.schedule() might bock (if I'm reading it correctly), waiting for available resources, hence leading to the >5ms timespan to occur.

I've seen "hangs" I cannot explain quite often but never got to debug where they come from. I hope this might be it so we can get it fixed.

Note; We use Tomcat (8.5.11) with Servlet30AsyncProvider.

--

Andreas Joseph Krogh

CTO / Partner - Visena AS

Mobile: +47 909 56 963

and...@visena.com

www.visena.com

Antonio Salazar Cardozo

unread,

May 8, 2017, 1:47:42 PM5/8/17

to Lift

Awesome, thanks for the breakdown and additional details David! If you get a chance,

please file an issue in Github; otherwise, I'll try and do that when I can. I'll also try to see

if I can get this sorted by the time 3.1.0 goes out the door… Maybe I'll even get a chance

to sneak it into the next milestone build.

Thanks,

Antonio

David Waern

unread,

May 9, 2017, 3:51:52 AM5/9/17

to Lift

On Monday, 8 May 2017 19:47:42 UTC+2, Antonio Salazar Cardozo wrote:

Awesome, thanks for the breakdown and additional details David! If you get a chance,
please file an issue in Github; otherwise, I'll try and do that when I can. I'll also try to see
if I can get this sorted by the time 3.1.0 goes out the door… Maybe I'll even get a chance
to sneak it into the next milestone build.
Thanks,
Antonio

No problem, I filed an issue in GitHub: https://github.com/lift/framework/issues/1848

Thanks,
David

Antonio Salazar Cardozo

unread,

May 10, 2017, 1:29:10 PM5/10/17

to Lift

Thanks David! I've dropped a milestone on it, and hopefully I should get a

chance to look at it in the next couple of weeks.

Thanks,

Antonio

Reply all

Reply to author

Forward