Improve performance on Properties Step

54 views
Skip to first unread message

Claire

unread,
May 8, 2020, 9:19:17 PM5/8/20
to JanusGraph users
Hello,

I am trying to optimize a simple recommendation query, and am somewhat stuck.

*Our Setup*
- Janusgraph 0.5.1
- Storage Backend: Scylla DB 3.2.4

*The Graph*
Our Graph contains millions of vertices and edges. In the relevant part, we have the following

Vertex: user (with several properties)
Vertex: query (with several properties, one being "title")

user is linked to query by an edge "searched".

Each user can have multiple searches, and it is possible that a user has different searches with the same title (but then other properties would differ)

*The Scenario*

I know that a user searched for something, let's say "Snowboard", and I want to present him with related search terms by doing an "other users searching for Snowboard also searched for" query.

Originally I started with the following query:

g.V().has('query', 'title', 'snowboard').in('searched').out('searched').has('query', 'title', neq('snowboard')).has('title').dedup().as("related").select("related").by('title').groupCount().order(Scope.local).by(Column.values, Order.desc).profile()

But the query time was beyond acceptable, thus I decided to do do the grouping and counting rather in the code (Java) then via the gremlin query. (after a timeLimit didn't bring the hoped for improvement, respetively really bad results)

The simplified gremlin query now looks as follows

g.V().has('search', 'title', within('snowboard', 'Snowboard')).in('searched').dedup().out('searched').values('title')



Doing a profile on that query, I see that the "values" step, costs a lot of time. I already tried with the query-fast option, but that didn't help any.

The profile step returns me the following

==>Traversal Metrics
Step                                                               Count  Traversers       Time (ms)    % Dur
=============================================================================================================
JanusGraphStep([],[~label.eq(search), sear...                 30246       30246         290.388     3.72
    \_condition=(~label = search AND (title = snowboard OR title = Snowboard))
    \_orders=[]
    \_isFitted=true
    \_isOrdered=true
    \_query=multiKSQ[2]@4000
    \_index=bySearchTitle
  optimization                                                                                 0.030
  optimization                                                                                14.736
  backend-query                                                                                0.000
    \_query=bySearchTitle:multiKSQ[2]@4000
    \_limit=4000
JanusGraphVertexStep(IN,[searched],...                 30245       30245        1811.447    23.22
    \_condition=type[searched]
    \_orders=[]
    \_isFitted=true
    \_isOrdered=true
    \_query=org.janusgraph.diskstorage.keycolumnvalue.SliceQuery@a4661abd
    \_multi=true
    \_vertices=30246
  optimization                                                                                 4.640
  backend-query                                                    30245                    1592.684
    \_query=org.janusgraph.diskstorage.keycolumnvalue.SliceQuery@a4661abd
DedupGlobalStep                                                    10296       10296          11.328     0.15
JanusGraphVertexStep(OUT,[searched]...                 79241       79241        1293.578    16.58
    \_condition=type[searched]
    \_orders=[]
    \_isFitted=true
    \_isOrdered=true
    \_query=org.janusgraph.diskstorage.keycolumnvalue.SliceQuery@a46616dd
    \_multi=true
    \_vertices=10296
  optimization                                                                                 0.174
  backend-query                                                    79241                     557.447
    \_query=org.janusgraph.diskstorage.keycolumnvalue.SliceQuery@a46616dd
NoOpBarrierStep(2500)                                              79241       79241          52.760     0.68
JanusGraphPropertiesStep([title],value)                      47709       47709        4322.742    55.42
    \_condition=type[title]
    \_orders=[]
    \_isFitted=true
    \_isOrdered=true
    \_query=org.janusgraph.diskstorage.keycolumnvalue.SliceQuery@8121f1dd
    \_multi=true
    \_vertices=79241
  optimization                                                                                 2.133
  backend-query                                                    47709                    3969.057
    \_query=org.janusgraph.diskstorage.keycolumnvalue.SliceQuery@8121f1dd
NoOpBarrierStep(2500)                                              47709        2293          18.347     0.24
                                            >TOTAL                     -           -        7800.592        -


What am I missing? Where is some room for improvement?

Gladly looking forward to any hint.

Regards
Claire

Stephen Mallette

unread,
May 12, 2020, 7:47:47 AM5/12/20
to janusgra...@googlegroups.com
I'm not sure I have a complete answer for you but your original traversal could have been improved/simplified a bit:

g.V().has('query', 'title', 'snowboard').
  in('searched').
  out('searched').has('query', 'title', neq('snowboard')).
  dedup().
  groupCount(). 
     by('title')
  order(Scope.local).
    by(Column.values, Order.desc)
  limit(10)

I've left it there but I'm not sure I understand the use of dedup() in this case. Won't that make all your counts go to one? I got rid of as().select() which should turn off path tracking and reduce the resources required to run the traversal. I also tacked on a limit(10) which was arbitrary but if you return less of those results you will have far less serialization costs which can make a big difference in performance.

You might also try to do a full barrier() to take greater advantage of bulking assuming dedup() should have went after in('searched') as shown in your second traversal:

g.V().has('query', 'title', 'snowboard').
  in('searched').
  barrier().
  out('searched').has('query', 'title', neq('snowboard')).
  groupCount(). 
     by('title')
  order(Scope.local).
    by(Column.values, Order.desc)
  limit(10)

You don't show your whole profile() but it seems you are just gathering a lot of data. You may need to find ways to better limit the paths you have to traverse in order to get a reasonable answer. For example, perhaps your recommendation can be based on the most recent data rather than all of it. Could you store a timestamp on the "searched" edges and then do:

  g.V().has('query', 'title', 'snowboard').
  inE('searched').has('timestamp', gt(oneWeekAgo)).outV().
  barrier().
  outE('searched').has('timestamp', gt(oneMonthAgo)).outV().has('query', 'title', neq('snowboard')).inV().
  groupCount(). 
     by('title')
  order(Scope.local).
    by(Column.values, Order.desc)
  limit(10)

With a vertex-centric index on that timestamp you could probably get some fast results that way. Perhaps you could even write a more complex limit that is timestamp and limit() oriented somehow depending on how your data is structured. 



--
You received this message because you are subscribed to the Google Groups "JanusGraph users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to janusgraph-use...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/janusgraph-users/91d8150d-6a7e-4b10-9c51-ea9322745d99%40googlegroups.com.
Reply all
Reply to author
Forward
0 new messages