Slow Gremlin query performance.

894 views
Skip to first unread message

Sabari Gandhi

unread,
Sep 22, 2014, 4:46:29 PM9/22/14
to aureliu...@googlegroups.com
Hi ,
 
 I am working with cassandra 2.0.8 / Titan 0.5. I see performance issue on my gremlin query. Please see below my query. I have index on both  city and type. City is my graphs super node.
 
 I have the following query,
 
 g.V('city', 'Boston').in('cityKey').has('type', 'SOFTWARE').count() (This is slow)

The performance of the query is very slow. When I run the query

 g.V('city', 'Boston').in('cityKey').count() (The result is instantaneous). 

I have read that .has sometimes does not use indexes and have also changed the query to use filter , but that doesn't help.
 
 For the amount of data I have , g.V('city', 'Boston').in('cityKey').count() results 44605 and the type software will have count greater than 44500. Also see below the time it takes which is very slow in both cases and see the usage of filter in the query.
 
gremlin> s=System.currentTimeMillis();g.V('city', 'Boston').in('cityKey').has('type', 'SOFTWARE').count();System.currentTimeMillis()-s
==>91545
gremlin> s=System.currentTimeMillis();g.V('city', 'Boston').in('cityKey').type.filter{it == 'SOFTWARE'}.count();System.currentTimeMillis()-s
==>92028
gremlin>

Is there any way I can rewrite this query for optimal results. And also if the usage of filter doesimprove the timing I would also likely to have multiple filters where type as software and language as java. Any help is greatly appreciated.

Thanks,
Sabari

Daniel Kuppitz

unread,
Oct 6, 2014, 10:31:39 AM10/6/14
to aureliu...@googlegroups.com
You can make use of vertex centric indices. Create the index, then move or copy the filter-property (type) to the cityKey-edges and then do this:

g.V('city', 'Boston').inE('cityKey').has('type','SOFTWARE').count()

For more use-cases (more filter criteria) create more indices.

Cheers,
Daniel

Sabari Gandhi

unread,
Oct 7, 2014, 3:28:17 PM10/7/14
to aureliu...@googlegroups.com
Hi Daniel,

Thanks a lot for the response. I have created vertex centric index on the edge and still I see the response is slow . Please see below the query 

 s=System.currentTimeMillis();g.V('city, 'Boston').has('type', 'software').as('a').outE('programmingLanguage').has('entityId', 'java').count();System.currentTimeMillis()-s

I have index on the edge programmingLanguage  as mentioned but still this takes 2.4 seconds to returns 120 count. See below the code that is used to add the vertex centric index.

if (!graphMgmt.containsRelationIndex(progDependsEdge, INDEX_BY_PROG_LANG)) {
            graphMgmt.buildEdgeIndex(progDependsEdge, INDEX_BY_PROG_LANG, Direction.BOTH,
                    entityIdKey);
            logger.info("Created edge index on {} edge for entityId", EDGE_PROG_LANG;
}

Can you please confirm the approach I am doing is right if not let me know the optimal approach. Also is there a way where I can confirm the index created is used by the query?
Thanks,
Sabari

Daniel Kuppitz

unread,
Oct 7, 2014, 6:08:12 PM10/7/14
to aureliu...@googlegroups.com
Hi Sabar,

I just tried the following code snippet:

g = TitanFactory.open("conf/titan-cassandra.properties")
m = g.getManagementSystem()
entityId = m.makePropertyKey('entityId').dataType(String.class).make()
programmingLanguage = m.makeEdgeLabel('programmingLanguage').make()
m.buildEdgeIndex(programmingLanguage, 'programmingLanguageByEntityId', Direction.BOTH, Order.ASC, entityId)
m.commit()

languages = ["java","c","c++","c#","f#","groovy","scala","python","ruby","php","assembler","javascript","coffeescript","basic","clojure","go","erlang","perl","pascal"]
nl = languages.size()
random = new Random()

root = g.addVertex(null)

(1..100000).each {
  root.addEdge("programmingLanguage", g.addVertex(null), ["entityId":languages[random.nextInt(nl)]])
  if (it%10000 == 0) { println it; g.commit() }
}; g.commit()

t = System.currentTimeMillis(); println root.outE("programmingLanguage").has("entityId","java").count(); System.currentTimeMillis() - t

In my test it created 5.200 java edges and the last line (the actual count()) took
  • 660 ms without database cache and
  • 90 ms with database cache

Cheers,
Daniel


--
You received this message because you are subscribed to the Google Groups "Aurelius" group.
To unsubscribe from this group and stop receiving emails from it, send an email to aureliusgraph...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/aureliusgraphs/2bc35c29-0dc3-4e9b-a993-27e6f98c82d5%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Sabari Gandhi

unread,
Oct 8, 2014, 1:29:06 PM10/8/14
to aureliu...@googlegroups.com
Hi Daniel:

Thanks a lot for the immediate response with the test results. Please see below my use case with questions:

g.V('city, 'Boston').has('type', 'software').as('a').outE('programmingLanguage').has('entityId', 'java
').count()
==>120


g.V('
city, 'Boston').has('type', 'software').count()
==>3000



We are doing the same implementation in java
  • In my scenario , the first part of the query "g.V('city, 'Boston').has('type', 'software')" will result more than one vertice in this case around 3000 which I will use again to traverse against the edge which has the vertex centric index to get the results in this case 120. And also each of the 2000 vertice resulted in first half of the query will have 8 edges.
  • Is vertex centric edge helpful in my scenario, if yes can you suggest ways of optimization .
  • Is there a way to confirm to check the query uses the index created?

Thanks Again,
Sabari

Daniel Kuppitz

unread,
Oct 8, 2014, 1:53:18 PM10/8/14
to aureliu...@googlegroups.com
Then you should try to use MultiQuery, that would definitely make more sense with a large amount of starting vertices. I'm currently not aware of any way that allows you to see whether the/an index is really used, but usually you should see huge differences in query times (when you compare indexed queries vs. non-indexed queries).

Cheers,
Daniel

Bob B

unread,
Oct 8, 2014, 2:03:10 PM10/8/14
to aureliu...@googlegroups.com

Praveen Peddi

unread,
Oct 8, 2014, 2:47:59 PM10/8/14
to aureliu...@googlegroups.com
Hi Daniel,
Does your response also mean vertex centric index is not helpful when there are only 5 to 10 edges per vertex but lot of vertices to start with? 

Daniel Kuppitz

unread,
Oct 8, 2014, 5:14:35 PM10/8/14
to aureliu...@googlegroups.com
Not at all. I didn't say anything against vertex centric indices, I just added MultiQuery into the mix. MultiQuery will start at multiple (N) vertices at once and will then use N vertex centric indices simultaneously.

Cheers,
Daniel

Reply all
Reply to author
Forward
0 new messages