Hi All,
I believe I tracked down the root cause of this issue:
It is the interaction of the code between ProjectQoSFilter and RequestContextFilter. The relevant events and threads involved in handling git over http request are:
* Threads:
(HTTP-T1)- is the Jetty HTTP request thread that is invoked when original HTTP request is made
(SSH-T1) - is any free thread from SSH thread pool
(HTTP-T2) - is the Jetty HTTP request thread where the HTTP request is actually process when continuation is resumed.
* Steps
1. (HTTP-T1) RequestContextFilter.doFilter is invoked, which in turn calls chain.doFilter(request, response);
2. (HTTP-T1) ProjectQoSFilter.doFilter is invoked, which suspends the request and queues task for Ssh thread pool (see step #4)
As there will be 3 threads involed. I'll go with the route where the original http request thread (HTTP-T1) unwinds faster than continuation thread (SSH-T1) becomes available. As this is the case where the client will receive the internal server error and makes the explanation simpler. However, strictly speaking the order of the events could be different, due to multithreading.
3. (HTTP-T1) The control returns to RequestContextFilter.doFilter where RequestCleanup.run is called
4. (SSH-T1) ProjectQoSFilter.TaskThunk.run is executed, it calls Jetty instructing it to resume the continuation and blocks the thread waiting on "lock" object.
5. (HTTP-T2) Jetty executes resumes the original HTTP request.
5a. (HTTP-T2) RequestContextFilter.doFilter is invoked, which in turn calls chain.doFilter(request, response);
5b. (HTTP-T2) ProjectQoSFilter.doFilter is invoked, which detects that this is a resumed continuation and passed execution onto the next filter in the chain.
* This is the code path that triggers the exception
5c. (HTTP-T2) AsyncReceiveCommits.<init> calls ReceiveCommits.Factory.create which requires Guice to call RequestScopedReviewDbProvider.get. This in turn gets a hold of the current instance of RequestCleanup and calls add() on it. However because RequestContextFilter already invoked RequestCleanup.run in (step #3), the exception is thrown
5d. (HTTP-T2) ProjectQoSFilter.doFilter final block is invoked which signals "lock" object and returns.
5e. (SSH-T1) TaskThunk.run returns from waiting on "lock" object and completes, thus freeing up the ssh thread
Now onto solutions:
* The ProjectQoSFilter is bound in the Daemon.createWebInjector method only in case where ssh daemon is enabled. Therefore, one can solve this issue by passing the --disable-sshd switch when starting gerrit. Which is what I used in my case as I have no need for ssh.
As for proper fix, I would defer this to Gerrit experts, but if I had to do it I would either:
* Rearranged the order of filters so that ProjectQoSFilter would execute before the filter responsible for clean up, but this can get tricky as ProjectQoSFilter uses current user information to determine which queue to use, which means that authentication/session filters had to have been executed before.
Or a more practical approach would be to:
* Alter RequestContextFilter.doFilter clean up logic to prevent it from running cleanup in the thread where request was suspended.
} finally {
cleanup.get().run();
}
to something like:
} finally {
final Continuation cont = ContinuationSupport.getContinuation(request);
if (!cont.isSuspended()) {
cleanup.get().run();
}
}
Hope this helps,
Vladimir