Problems with Netty's MultithreadEventExecutorGroup

134 views
Skip to first unread message

Mike Hobbs

unread,
Jan 22, 2015, 2:49:52 PM1/22/15
to gremli...@googlegroups.com, Andrew Guldman
While tracking down a problem I had with HTTP keepalive connections, I uncovered a problem with Netty's MultithreadEventExecutorGroup (via DefaultEventExecutorGroup).

First, let me describe the problem I was having with HTTP keepalive. When using keepalive connections, I discovered that script execution would hang once the number of requests reached the value of gremlinPool. To easily reproduce the problem, just set gremlinPool to 2 in gremlin-server-rest-modern.yaml and then run this curl command:
$ curl http://localhost:8182?gremlin=5 http://localhost:8182?gremlin=5
The second request will hang for 30 seconds and then return an error.

What happens is that GremlinExecutor.eval() calls CompletableFuture.supplyAsync() with a DefaultEventExecutorGroup as the executorService, and then HttpGremlinEndpointHandler waits for the returned future to complete. If you dig into the details of how DefaultEventExecutorGroup works, you'll see that its parent, MultithreadEventExecutorGroup, contains an array of SingleThreadEventExecutors -- and the gremlinPool value is used to configure the size of that array. When MultithreadEventExecutorGroup.execute() method is called, it grabs the next SingleThreadEventExecutor out of the array in a round-robin fashion and invokes its execute() method. However, one of those SingleThreadEventExecutors is the currently running HttpGremlinEndpointHandler. In the example above, where gremlinPool = 2, the array will have 2 EventExecutors and HttpGremlinEndpointHandler will be running in one of them. When the first request comes in, the GremlinExecutor executes on the next EventExecutor and everything returns fine, but when the next request comes in, MultithreadEventExecutorGroup blindly execute()'s on the next EventExecutor, which is already running HttpGremlinEndpointHandler, and so everything halts while HttpGremlinEndpointHandler waits for the future to timeout. When using a larger gremlinPool value, it takes several more requests to get around to HttpGremlinEndpointHandler's EventExecutor, but it does eventually happen.

The way I've gotten around the problem is to avoid keepalive connections, which is not exactly optimal. Opening a new connection resets the HttpGremlinEndpointHandler's position in the round-robin array and prevents it from being selected as the next EventExecutor.

Offhand, I can suggest of a few different ways to work around the problem within Gremlin's code base: One option is to use different executorServices for GremlinExecutor and HttpGremlinEndpointHandler. (In fact, setting gremlinPool to 1 makes every request hang, since they're both trying to execute on the same thread.) Another option is to eval scripts in-line, without submitting them to the executorService -- but that makes it difficult to timeout the scripts. Lastly, I experimented with subclassing the DefaultEventExecutorGroup so that the round-robin selection first checks the EventExecutors to make sure that they are idle before trying to use them. That seemed to work well enough in my [minimal] testing:

class GremlinEventExecutorGroup extends DefaultEventExecutorGroup {
    GremlinEventExecutorGroup(int nThreads, ThreadFactory threadFactory) {
        super(nThreads, threadFactory);
    }
    @Override
    public EventExecutor next() {
        EventExecutor firstChoice = super.next();
        // Check if executor is idle before returning it:
        if (!firstChoice.inEventLoop()) {
            return firstChoice;
        }
        while (true) {
            EventExecutor next = super.next();
            if (next == firstChoice) {
                // No other choices are idle either. Return this one and hope for the best:
                return next;
            }
            if (!next.inEventLoop()) {
                return next;
            }
        }
    }
}


Thanks,
- Mike




Stephen Mallette

unread,
Jan 23, 2015, 7:15:06 AM1/23/15
to gremli...@googlegroups.com, Andrew Guldman
Thanks for the deep analysis there.  I started playing with this a little bit to see how things worked for myself.  I'll post back here when there is some solution to the problem in place.

--
You received this message because you are subscribed to the Google Groups "Gremlin-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gremlin-user...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/gremlin-users/da24ecff-94eb-48c5-8e6b-b292f02f9999%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Stephen Mallette

unread,
Jan 25, 2015, 5:21:47 PM1/25/15
to gremli...@googlegroups.com, Andrew Guldman
Mike, I pushed a fix for this issue.  I totally reworked the HttpGremlinEndpointHandler and changed how the Netty pipeline was being initialized.  I have some more tests to do to see how my changes affected the websockets side, but I think the REST endpoint is better for this work.  If you get a chance to try it out, please let me know if it works better for you.

Thanks again for your detailed analysis of the problem you were facing - it put me on a much faster track to coming up with a solution.

Best regards,

Stephen

Andrew Guldman

unread,
Jan 26, 2015, 10:35:25 AM1/26/15
to gremli...@googlegroups.com, agul...@fluid.com
Hi Stephen,

Thanks for the quick fix. I assume that it is available in the snapshot builds now, right? In which release will it be included? When is that release expected?

Cheers,
Andrew

Stephen Mallette

unread,
Jan 26, 2015, 10:48:52 AM1/26/15
to gremli...@googlegroups.com
You would have to build it up from source though there should be a SNAPSHOT published assuming the Travis build succeeded.  This fix would be part of M8 and I'm not sure that we have a hard date for that at this point.

Mike Hobbs

unread,
Jan 26, 2015, 12:17:56 PM1/26/15
to gremli...@googlegroups.com, agul...@fluid.com
Yep, I tried some basic testing and it's working much better. Thanks. (Our code is currently based off of M6 classes, so there will need to be some extra migration there before I'm able to do a full test, but I don't expect any issues.)

Thanks again,
- Mike
Reply all
Reply to author
Forward
0 new messages