While tracking down a problem I had with HTTP keepalive connections, I uncovered a problem with Netty's MultithreadEventExecutorGroup (via DefaultEventExecutorGroup).
First, let me describe the problem I was having with HTTP keepalive. When using keepalive connections, I discovered that script execution would hang once the number of requests reached the value of gremlinPool. To easily reproduce the problem, just set gremlinPool to 2 in gremlin-server-rest-modern.yaml and then run this curl command:
$ curl http://localhost:8182?gremlin=5 http://localhost:8182?gremlin=5
The second request will hang for 30 seconds and then return an error.
What happens is that GremlinExecutor.eval() calls CompletableFuture.supplyAsync() with a DefaultEventExecutorGroup as the executorService, and then HttpGremlinEndpointHandler waits for the returned future to complete. If you dig into the details of how DefaultEventExecutorGroup works, you'll see that its parent, MultithreadEventExecutorGroup, contains an array of SingleThreadEventExecutors -- and the gremlinPool value is used to configure the size of that array. When MultithreadEventExecutorGroup.execute() method is called, it grabs the next SingleThreadEventExecutor out of the array in a round-robin fashion and invokes its execute() method. However, one of those SingleThreadEventExecutors is the currently running HttpGremlinEndpointHandler. In the example above, where gremlinPool = 2, the array will have 2 EventExecutors and HttpGremlinEndpointHandler will be running in one of them. When the first request comes in, the GremlinExecutor executes on the next EventExecutor and everything returns fine, but when the next request comes in, MultithreadEventExecutorGroup blindly execute()'s on the next EventExecutor, which is already running HttpGremlinEndpointHandler, and so everything halts while HttpGremlinEndpointHandler waits for the future to timeout. When using a larger gremlinPool value, it takes several more requests to get around to HttpGremlinEndpointHandler's EventExecutor, but it does eventually happen.
The way I've gotten around the problem is to avoid keepalive connections, which is not exactly optimal. Opening a new connection resets the HttpGremlinEndpointHandler's position in the round-robin array and prevents it from being selected as the next EventExecutor.
Offhand, I can suggest of a few different ways to work around the problem within Gremlin's code base: One option is to use different executorServices for GremlinExecutor and HttpGremlinEndpointHandler. (In fact, setting gremlinPool to 1 makes every request hang, since they're both trying to execute on the same thread.) Another option is to eval scripts in-line, without submitting them to the executorService -- but that makes it difficult to timeout the scripts. Lastly, I experimented with subclassing the DefaultEventExecutorGroup so that the round-robin selection first checks the EventExecutors to make sure that they are idle before trying to use them. That seemed to work well enough in my [minimal] testing:
class GremlinEventExecutorGroup extends DefaultEventExecutorGroup {
GremlinEventExecutorGroup(int nThreads, ThreadFactory threadFactory) {
super(nThreads, threadFactory);
}
@Override
public EventExecutor next() {
EventExecutor firstChoice = super.next();
// Check if executor is idle before returning it:
if (!firstChoice.inEventLoop()) {
return firstChoice;
}
while (true) {
EventExecutor next = super.next();
if (next == firstChoice) {
// No other choices are idle either. Return this one and hope for the best:
return next;
}
if (!next.inEventLoop()) {
return next;
}
}
}
}
Thanks,
- Mike