Gremlin server waits on "writeBufferHighWaterMark exceeded" forever

896 views
Skip to first unread message

Tunay Gur

unread,
Mar 5, 2017, 5:44:42 PM3/5/17
to gremli...@googlegroups.com
Hi Gremlin users, 

I have a gremlin server that serves fairly dense graph and I try to run load tests on it with a query which has fairly big results(computationally expensive + returns large number of vertices and edges):

g.V().has("User", "uuid",<some_uuid>).repeat(__.bothE().subgraph("subGraph").otherV()).times(3).cap("subGraph").next()

After some time into the load test I see the following warning from all gremlin-server-exec threads and server becomes unresponsive:

05 Mar 2017 20:00:18,419 166886789 [gremlin-server-exec-20] WARN  org.apache.tinkerpop.gremlin.server.op.traversal.TraversalOpProcessor  - Pausing response writing as writeBufferHighWaterMark exceeded on RequestMessage{, requestId=2353be78-c157-4194-807f-5ec002f86138, op='bytecode', processor='traversal', args={gremlin=[[], [V(), has(User, uuid, 7331c498-f2c1-4d44-bf3a-2410e1640235)]], aliases={g=g}}} - writing will continue once client has caught up

 I believe they go into the following loop: 


My theory is, (for some reason) some of the clients drop connection before consuming the response from server. However server threads (netty) still thinks that these are active channels and keeps busy waiting on a channel forever that no one consumes from. 

Has anyone experienced something similar to this before ? Or suggestions on what I might be doing wrong ? 


Thanks 
Tunay


Stephen Mallette

unread,
Mar 6, 2017, 6:55:46 AM3/6/17
to Gremlin-users
I'm not sure how Gremlin Server would get into that state - I've not witnessed that myself.When I've seen that message in testing, the pause is eventually lifted and responses start to flow again.I think you'd have to come up with some way to reproduce the problem.

--
You received this message because you are subscribed to the Google Groups "Gremlin-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gremlin-users+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/gremlin-users/CAJKwzMDPkrHhqKDBBSHLn-wWnwK7z5jHM3ck4A6OR13dgCOOhw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Mauricio Pradilla

unread,
Jun 5, 2017, 10:19:08 AM6/5/17
to Gremlin-users
Hi Tunay,

I am running into the issue while querying the graph with gremlinpython. Do you have any suggestion to tackle this trouble? I tried to increase in the server configuration the writeBufferHighWaterMark size with no success... 

I appreciate any info!

thanks

Stephen Mallette

unread,
Jun 5, 2017, 10:33:03 AM6/5/17
to Gremlin-users
What do you do to get the server into that state? Is it long run traversals? Traversals that return big results? Lots of concurrent requests?

--
You received this message because you are subscribed to the Google Groups "Gremlin-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gremlin-users+unsubscribe@googlegroups.com.

Mauricio Pradilla

unread,
Jun 5, 2017, 11:38:51 AM6/5/17
to Gremlin-users
Hi Stephen,

the query I am running is the following:

g.V(xyz).in().has('isrc') \
             
.groupCount().by("isrc") \
             
.order(local).by(values, decr).limit(local, 10).toList()


- For one scenario with 150 vertex matching the group condition - the gremlinpython works well. The trouble gets with higher groups - particularly with one of 9.000 items.
- The response will be always just the top 10 sorted by desc. 

- The python client using gremlin_python gets the following exception:

 File "/Users/.../venv/lib/python2.7/site-packages/gremlin_python/driver/driver_remote_connection.py", line 214, in receive
    recv_message
= json.loads(recv_message.decode('utf-8'))
AttributeError: 'NoneType' object has no attribute 'decode'


Maybe I am trying to solve my problem with wrong path...


thank you! 
To unsubscribe from this group and stop receiving emails from it, send an email to gremlin-user...@googlegroups.com.

David Brown

unread,
Jun 7, 2017, 11:37:41 AM6/7/17
to Gremlin-users
I feel like have seen this before sometime in the murky past...I wonder if this relates to limitations of the old client implementation, which was highly limited and really just served to help get gremlin-python on its feet. Have you tried building the tp32 branch from Github (soon to be the 3.2.5 release) and testing again? This may not be an option for you, but I would be curious if you still experience this with the new implementation.

Best,

David

Stephen Mallette

unread,
Jun 8, 2017, 6:27:30 AM6/8/17
to Gremlin-users
You might also try to develop a test with the java driver to verify if it is related to just python or if it happens just generally. I wonder if there's something in the python driver that makes it look like once a client is considered "slow" the channel doesn't come back. Seems a bit unlikely as there is really nothing a client has to do to tell the server its ready for more requests. Just throwing out ideas for how this might be diagnosed a bit better.

To unsubscribe from this group and stop receiving emails from it, send an email to gremlin-users+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/gremlin-users/48bd80e4-ecb8-4abf-b7a3-1d59163d90a6%40googlegroups.com.

Mauricio Pradilla

unread,
Jun 9, 2017, 10:26:08 AM6/9/17
to Gremlin-users
Thank you @David and @Stephen for your answers. 

I have identified some issues on the python driver, related with the websocket connection management. Someone - the gremlin server or the driver is closing the connection before the response comes in. The crash is then produced due a bad management of the connection close. I am doing some tests changing the websocket configuration in the driver side. 

- Is there a configuration variable on the gremlin server related with the websocket timeout? 

Stephen Mallette

unread,
Jun 12, 2017, 7:53:10 AM6/12/17
to Gremlin-users
The Java Driver has a connectionPool.keepAliveInterval which will send a websocket ping to the server. I don't know if the python driver supports that...if this is the problem then perhaps that would fix it.

To unsubscribe from this group and stop receiving emails from it, send an email to gremlin-users+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/gremlin-users/d9544b65-1cc3-48c2-9cfd-da0e0f932876%40googlegroups.com.

an...@disruptek.com

unread,
Jul 10, 2017, 2:15:09 AM7/10/17
to Gremlin-users
FWIW, the problem exists in both the Python and JavaScript -- https://github.com/jbmusso/gremlin-javascript -- libraries.  Has anyone done any work on this?  If someone's working on the Python side, perhaps we can compare notes and code against the JavaScript side, or vice-versa.

Stephen Mallette

unread,
Jul 10, 2017, 6:23:44 AM7/10/17
to Gremlin-users
We are continually trying to get the drivers more in line with each other. I'm not sure if anyone is specifically working on this particular issue. David Brown is on vacation for a few more days - perhaps he will have something to say on the python piece when he gets back.

On Sat, Jul 8, 2017 at 3:11 PM, <an...@disruptek.com> wrote:
FWIW, the problem exists in both the Python and JavaScript -- https://github.com/jbmusso/gremlin-javascript -- libraries.  Has anyone done any work on this?  If someone's working on the Python side, perhaps we can compare notes and code against the JavaScript side, or vice-versa.

--
You received this message because you are subscribed to the Google Groups "Gremlin-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gremlin-users+unsubscribe@googlegroups.com.

an...@disruptek.com

unread,
Jul 10, 2017, 2:04:51 PM7/10/17
to Gremlin-users
The websocket protocol spec demands that special "ping" control frames are answered by "pong" responses.

I took a quick look through some of the relevant codebases:
  • The Java client apparently pings at regular intervals, but takes no action in response to pong replies (or the absence of same).
  • The JavaScript client doesn't implement the ping/pong spec at all.
  • The aiogremlin client similarly doesn't implement the spec.
  • The Java server doesn't implement the spec, which obviously makes the client implementations moot!
I'm not sure this is the actual source of the problem or that implementing these semantics will solve it, but I thought it was worth sharing.

Stephen Mallette

unread,
Jul 10, 2017, 2:29:57 PM7/10/17
to Gremlin-users
The server implements ping/pong - it's handled directly by netty if i remember correctly. 

The Java client apparently pings at regular intervals, but takes no action in response to pong replies

what action should it take when a pong arrives?

--
You received this message because you are subscribed to the Google Groups "Gremlin-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gremlin-users+unsubscribe@googlegroups.com.

an...@disruptek.com

unread,
Jul 11, 2017, 2:24:53 PM7/11/17
to Gremlin-users
The problem that I'm seeing is that the server does not recognize when a client is rudely disconnected before a reply is sent, so the write is buffered and (in my case, with 30-60 requests from the same client) the Gremlin server is locked up indefinitely.

I don't know that there is any particular best-practice for handling ping/pong, but a simple implementation might be as follows...

If the client tells the server that it supports keepalive, any frame receipt by the server should reset an interval timer, be that frame a ping/pong or a standard request/reply transmission.  If the timer runs out, we should send a ping.  If we don't get any frame before the next timer runs out, we presume that the client is dead.

My next step is probably to implement ping/pong on a client so we can at least verify ping receipt from the server and see if sending pongs (or more accurately, NOT sending pongs) actually works to provoke server disconnect.

Stephen Mallette

unread,
Jul 11, 2017, 2:43:41 PM7/11/17
to Gremlin-users
> If the client tells the server that it supports keepalive, any frame receipt by the server should reset an interval timer, be that frame a ping/pong or a standard request/reply transmission.  If the timer runs out, we should send a ping.  If we don't get any frame before the next timer runs out, we presume that the client is dead.

i see what you're saying - you're talking about implementing the other side of the keepalive - the server pinging the client and the client returning a pong. fair enough.

My next step is probably to implement ping/pong on a client so we can at least verify ping receipt from the server and see if sending pongs (or more accurately, NOT sending pongs) actually works to provoke server disconnect.

I verified that yesterday in the debugger through this test:


I got a pong on the client during that test. if you do happen to do work on this, please create a JIRA ticket and target your work for the tp32 branch. 

Thanks for your thoughts on issue - would be nice to see all TinkerPop clients behaving consistently in this regard.



--
You received this message because you are subscribed to the Google Groups "Gremlin-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gremlin-users+unsubscribe@googlegroups.com.

an...@disruptek.com

unread,
Jul 12, 2017, 1:36:56 PM7/12/17
to Gremlin-users
Right, but a pong received on the client is immaterial to the problem -- and we already know that the Java client sends pings correctly.  Do we happen to have a test to check for pings from the server?  If so, what does the server do if it does not receive a pong?

I'm probably not the right guy to hack on the Tinkerpop side because that's exactly what you'd get -- a hack -- as I haven't written any Java in literally decades.  ;-)

I should have some time to revisit this on the client side in another 2-3 weeks.  It's a simple task to implement the ping/pong, but it has to be prioritized behind innumerable other simple tasks.  :-(

Stephen Mallette

unread,
Jul 12, 2017, 1:41:17 PM7/12/17
to Gremlin-users
i don't think pings are being sent from the server....unless netty is doing it for us (i'm guessing not). Sounds like a todo list item. Please feel free to create an issue in JIRA - perhaps someone can get around to digging into it further.

--
You received this message because you are subscribed to the Google Groups "Gremlin-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gremlin-users+unsubscribe@googlegroups.com.

David Brown

unread,
Jul 17, 2017, 2:20:26 PM7/17/17
to Gremlin-users
Couple things here:

- Does this have a JIRA yet...? Maybe it should if it doesn't.

- Do we know what client Tunay was using when he first reported? Was it gremlin-python?

- Mauricio, what version of gremlin-python are you using? That looks like old code. I believe it was changed before the 3.2.4 release [1].

A couple of things about ping/pong with the Python clients:

- Neither gremlin-python nor aiogremlin send pings to the server automatically.

   * The Tornado client used by gremlin-python can be configured to automatically ping the server at given intervals [2].
   * The aiohttp client used by aiogremlin does not automatically ping. Any client pings would need to be handled by aiogremlin driver application                     code.

- gremlin-python and aiogremlin have different behaviors upon receiving a ping from the server.

    * The Tornado client used by gremlin-python apparently does not respond to pings automatically [3], which is a bit surprising as this violates the                    WebSocket Protocol [4].
    * The aiohttp client used by aiogremlin by default responds to any server side pings with a pong [5].

I am going to try to look into this issue this week, write some tests that try to reproduce and locate the error, and determine the desired behavior of client/server connection maintenance/closes/reconnects. Any info you guys can give me based on this response would be extremely helpful.

Best,

Dave



On Wednesday, July 12, 2017 at 1:41:17 PM UTC-4, Stephen Mallette wrote:
i don't think pings are being sent from the server....unless netty is doing it for us (i'm guessing not). Sounds like a todo list item. Please feel free to create an issue in JIRA - perhaps someone can get around to digging into it further.
On Wed, Jul 12, 2017 at 1:36 PM, <an...@disruptek.com> wrote:
Right, but a pong received on the client is immaterial to the problem -- and we already know that the Java client sends pings correctly.  Do we happen to have a test to check for pings from the server?  If so, what does the server do if it does not receive a pong?

I'm probably not the right guy to hack on the Tinkerpop side because that's exactly what you'd get -- a hack -- as I haven't written any Java in literally decades.  ;-)

I should have some time to revisit this on the client side in another 2-3 weeks.  It's a simple task to implement the ping/pong, but it has to be prioritized behind innumerable other simple tasks.  :-(


On Tuesday, July 11, 2017 at 12:43:41 PM UTC-6, Stephen Mallette wrote:
I verified that yesterday in the debugger through this test:


I got a pong on the client during that test. if you do happen to do work on this, please create a JIRA ticket and target your work for the tp32 branch. 

Thanks for your thoughts on issue - would be nice to see all TinkerPop clients behaving consistently in this regard.

--
You received this message because you are subscribed to the Google Groups "Gremlin-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gremlin-user...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages