Gremlin query via HTTP is extremely slow

665 views
Skip to first unread message

Glennie Helles Sindholt

unread,
Sep 6, 2018, 10:10:47 AM9/6/18
to JanusGraph users

I'm running JanusGraph 0.2.0 with a DynamoDB backend. I'm experiencing some performance issues that just do not make sense to me and I'm hoping someone can explain it to me. Here is my scenario:


I'm running two very simple gremlin queries through both the Gremlin Console and via an HTTP request (issued from the same machine as the Gremlin Server resides on, so no network issues). The queries look like this:


First query:


via console: g.V(127104, 1069144, 590016, 200864).out().count() via http: curl -XPOST -Hcontent-type:application/json -d '{"gremlin":"g.V(127104, 1069144, 590016, 200864).out().count()}' http://localhost:8182


Second query:


via console: g.V(127104, 1069144, 590016, 200864).out().in().dedup().count() via http: curl -XPOST -Hcontent-type:application/json -d '{"gremlin":"g.V(127104, 1069144, 590016, 200864).out().in().dedup().count()}' http://localhost:8182


It is by no means a huge graph - the first query returns 750 and the second query returns 9154. My problem is that I see huge performance differences between the queries run via HTTP compared to the console. For the first query both the console and the HTTP request returns immediately and looking at the gremlin server log, I'm please to see that the query takes only 1-2 milliseconds in both cases. All is good.


Now for the second query, the picture changes. While the console continues to provide the answer immediately, it now takes between 4 and 5 seconds (!!) for the HTTP request to return the answer! The server log reports roughly the same execution time (some 50-60 ms) for both executions of the second query, so what is going on? I'm only doing a count(), so the slow HTTP response cannot be a serialization issues - it only needs to return a number, just as in the first query.


Can anyone explain this huge delay in HTTP response to me?

Florian Hockmann

unread,
Sep 6, 2018, 11:42:23 AM9/6/18
to JanusGraph users
Have you tried to execute the second query multiple times in a row?
Sending the Gremlin traversals via HTTP as you did means that the server has to compile it as Groovy code which can take a few seconds and probably accounts for a big part of the delay. Gremlin Server uses a cache for those traversals, so it should be much faster for subsequent requests when they are exactly the same.

Using script parameterization could solve your problem in this case, at least when you are sending the same queries multiple times but just with different parameters. The first time a query is executed by the server would still be that slow.

Apart from script parameterization, you could use a GLV (Gremlin Language Variant) to execute Gremlin Traversals directly from a programming language like Java, Python, JavaScript, or C# if using HTTP is not a requirement for you. That should also be faster as GLVs are usually compiled to Java on the server-side which is a lot faster than compiling Groovy. You can read more about this performance difference in this post from Marko Rodriguez. The summary of that post:

JavaTranslator is about 1000x faster than a evaluating a String script and about 3x faster than evaluating a compiled script. JavaTranslator takes about 40 micro-seconds to translate the bytecode, while an uncached String script takes 40 milliseconds.

So, what did we learn?

1. Bytecode is slick in that we don’t have to use Gremlin-Groovy to evaluate it (if there are no lambdas) and thus, can do everything in Java and fast!
2. It very important to always use parameterized queries with GremlinServer/etc. as you can see how costly it is to evaluate a String script repeatedly.

Glennie Helles Sindholt

unread,
Sep 10, 2018, 4:04:16 AM9/10/18
to JanusGraph users
Yes, I have tried to execute it many times in a row with no changes to the script at all - I am aware of the compile time required for new scripts so I am specifically avoiding that. It's the exact same script that I'm sending again and again. I should also note, that in my actual use case, I query the graph from an AWS lambda function via the JavaScript-client (and yes, I use script parameterization), but that was also rather slow (~2 seconds) compared to the execution times I saw in the console (60 ms), which was why I started experimenting with performance. I tried the HTTP-request on the same machine to rule out network delays.

I should maybe also mention that I find it very easy to reproduce. I have created a new graph where I load a little bit of data, check performance of the HTTP request, load more data, check perfomance etc, and response times of the HTTP requests get slower and slower as more and more data gets loaded into the graph. Of course it's not so much that it gets slower on larger graphs than on smaller graphs - it's more the fact that the response times of the HTTP requests are orders of magnitudes slower than in the console.

Glennie Helles Sindholt

unread,
Sep 10, 2018, 1:08:27 PM9/10/18
to JanusGraph users
And a little update. I have tried to load data into Tinkergraph instead of Janusgraph and now the HTTP request is returning as fast as in the console. So the slow response appears to be linked to Janusgraph...

Glennie Helles Sindholt

unread,
Sep 14, 2018, 7:16:31 AM9/14/18
to JanusGraph users
I have posted the question on StackOverflow as well with screenshot of the execution times obtained by running .profile() (as suggested by stephen mallette). It is evident that the execution times for the different steps are much, MUCH slower when calling via HTTP, which to me makes no sense at all. Is JanusGraph using different query implementations depending on how it is called??

Jason Plurad

unread,
Sep 14, 2018, 9:17:02 AM9/14/18
to JanusGraph users

No, JanusGraph doesn't do anything different when you call it using HTTP. JanusGraph uses the Gremlin Server unchanged from TinkerPop. If you are able to use a :remote connection with the Gremlin Console, I'd expect the results from the profile() step should be about the same as if you did the same with an HTTP request.

You could try using a different storage backend, like embedded BerkeleyJE, to eliminate whether the connection to DynamoDB is part of the issue.

Your Gremlin Server is configured with the WsAndHttpChannelizer? Any other changes in the gremlin-server.yaml? Are you able to share your data and code project?

Stephen Mallette

unread,
Sep 14, 2018, 9:19:20 AM9/14/18
to JanusGraph users list
Might need a JFR (or some data with traversal to reproduce reliably) to sort this. Can't think of one thing in the http endpoint code that would make a difference at this level of any graph.

--
You received this message because you are subscribed to the Google Groups "JanusGraph users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to janusgraph-use...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/janusgraph-users/4733564f-8a52-4f3b-91c7-710e7c42cf47%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Glennie Helles Sindholt

unread,
Sep 14, 2018, 11:23:26 AM9/14/18
to JanusGraph users
Yes, Gremlin server is configured with WsAndHttpChannelizer but no other changes to the gremlin-server.yaml file has been made (I've used this guide to setup Janusgraph with Dynamo: https://bricaud.github.io/personal-blog/janusgraph-running-on-aws-with-dynamodb/).

I can't really provide the original data, but I will try to anonymize it, so I can share it.

Stephen Mallette

unread,
Sep 14, 2018, 11:36:34 AM9/14/18
to JanusGraph users list
Just a suggestion, but if you have to hassle with big downloads and anonymization, a Java Flight Recording showing memory/cpu usage might be a faster/easier approach. personally, i'd rather look at that as opposed to recreating your environment (last resort).


Glennie Helles Sindholt

unread,
Sep 17, 2018, 9:42:34 AM9/17/18
to JanusGraph users
So, I have anonymized the data (was actually not too much of a hassle ;) and created a simple github project (https://github.com/gsindholt/public-graph-test) that will load this data into a Janusgraph - the readme-file includes the details. I have verified that the huge delay is still present with this anonymized data and it is.

I'm not sure how to go about this Flight Recording - as far as I can see it requires a commercial license, which I do not have...

Stephen Mallette

unread,
Sep 17, 2018, 8:39:16 PM9/17/18
to JanusGraph users list
> I'm not sure how to go about this Flight Recording - as far as I can see it requires a commercial license, which I do not have...

Java Flight Recorder is "free on developer desktops/laptops":


Nice that you got the anonymization done though - maybe when a profiler gets slapped on this thing we can find out where the bottleneck is.



Glennie Helles Sindholt

unread,
Sep 18, 2018, 3:46:52 AM9/18/18
to JanusGraph users
Hmm... looks like I'm banging my head against another wall. I'm running OpenJDK and according to this "The OpenJDK does not have commercial features or Flight recorder. These are part of the Oracle JDK only." *sigh*

Glennie Helles Sindholt

unread,
Sep 20, 2018, 7:32:12 AM9/20/18
to JanusGraph users
I changed OpenJDK to the OracleJDK and have managed to run two flight-recordings of the server. One where I executed the query via HTTP (from the gremlin server) and one where I queried through the console. Now, I'm no expert in reading these recording, but the call stacks looks surprisingly different to me. I have attached the jfr. files here for you to inspect - do they make any sense to you guys?

/Glennie
console.jfr
http_local.jfr

Stephen Mallette

unread,
Sep 20, 2018, 8:11:17 AM9/20/18
to JanusGraph users list
Thanks for figuring that out. A few things:

1. Can you please explain what your test process was doing on the client side to trigger these? did you just run the single traversal once for the console and once for http? or something else? it may be necessary to extend your recording a bit. in other words, just script your client side to repeatedly make the calls so that we get a some more data to look at. it looks like you have about a minute of recording time for both (total time the server was up was much longer though), but not a lot of activity on the server otherwise.
2. There seems to be a significant amount of time spent in derser with Dynamo for HTTP, but i don't see that for the console. Hard to say why that is.
3. I'm not seeing a lot of TinkerPop classes in the "hot methods" which can mean that most of the time is being taken up elsewhere somehow, but I also wonder if we have a long enough sample time (item 1) to make a determination on that
4. I think it would be nice to enable Allocation Profiling and Heap Statistics as part of this: http://isuru-perera.blogspot.com/2016/03/specifying-custom-event-settings-file.html - I guess that they don't enable that by default because it might have an effect on the very performance you're trying to measure, but it tends to provide another dimension to consider.
5. I think you need to change your default memory configuration for the server. Looks like you're currently running as "-Xms32m -Xmx512m" - I'm not sure how much memory your system has, but perhaps go with something like  "-Xms2048m -Xmx4096m" ?




Glennie Helles Sindholt

unread,
Sep 21, 2018, 9:17:37 AM9/21/18
to JanusGraph users
I am noticing something that might be of relevance. I was setting up scripts to run the query continuously, as you suggested, and I notice that apart from the very first execution of the query from the console, none of the following executions were hitting Dynamo at all - indicating to me that the console has some sort of cache that all subsequent executions simply retrieve the result from. When calling via HTTP every execution of the query hits Dynamo. I assume this is because HTTP is sessionless while the console has a session. Of course, the very first excution of the query (both via console and via HTTP) should still show roughly the same execution time, so I guess my question is: is a query from a session executed via the same code as sessionless queries or could that perhaps be the culpit??

Stephen Mallette

unread,
Sep 21, 2018, 9:28:54 AM9/21/18
to JanusGraph users list
HTTP will auto-close a transaction at the end of the request. The Console connects with a session so that you can hold variables between requests. I guess you could confirm your findings if you initialized the console to auto-manage the transaction which you could do with:

:remote connect tinkerpop.server conf/remote.yaml session-managed

or, just add a rollback() after each execution of the traversal in the console. sorry, i didn't think of that one a long time ago. it never once crossed my mind.



Glennie Helles Sindholt

unread,
Sep 24, 2018, 6:32:30 AM9/24/18
to JanusGraph users
Thank you Stephen for that suggestion :) Everything makes sense now and since adding the rollback(), query times via HTTP and via the console are now comparable. Unfortunately though, they are now both slow, which I guess is somewhat surprising to me. As I mentioned, the graph is currently not big, so the fact that a simple query like this one takes ~4 seconds to complete makes it almost useless to me - I expect the graph to grow much, much bigger. I will experiment with a Cassandra backend to see if that improves perfomance, or do you have any other good ideas as to what can be done to improve query performance?

/Glennie

Stephen Mallette

unread,
Sep 24, 2018, 7:38:38 AM9/24/18
to JanusGraph users list
The traversal itself doesn't seem like it can really be improved from a Gremlin perspective:

g.V(127104, 1069144, 590016, 200864).out().in().dedup().count()

I mean...if your requirement is to traverse all out edges to their adjacent vertex and then traverse back on in edges to the adjacent vertex and expect duplicates, there's not much else you can do. I do think that 5 seconds for a count of about 9000 seems a bit slow, but it's not clear how many edges are being traversed. You could be ending up with 9000 after dedup'ing 100,000. I suppose you could dedup the out() if your graph structure is such that there would be duplicates there because it doesn't seem like you care about those - no need to traverse their edges twice since you're throwing  them away:

g.V(127104, 1069144, 590016, 200864).out().dedup().in().dedup().count()

You're probably on the right track with the idea of experimenting a bit. Maybe cassandra will be faster. Also, maybe consider how much of your graph can be held in memory to answer these kinds of queries. you can see how fast it returns results when the graph is cached in memory - maybe you can leverage that somehow even though your graph is getting bigger as you say. Or maybe you need to do some precalculations with gremlin/spark? or perhaps something should change in your schema? 

anyway...at least we sorted out the weird performance discrepancy - that was really perplexing to me. hopefully i'll remember that little difference between HTTP and Console if i ever hear about that problem again. 



Reply all
Reply to author
Forward
0 new messages