SimplePath query is slower in 6 node vs 3 node Cassandra cluster

66 views
Skip to first unread message

Varun Ganesh

unread,
Nov 24, 2020, 4:35:22 PM11/24/20
to JanusGraph users
Hello,

I am currently using Janusgraph version 0.5.2. I have a graph with about 18 million vertices and 25 million edges.

I have two versions of this graph, one backed by a 3 node Cassandra cluster and another backed by 6 Cassandra nodes (both with 3x replication factor)

I am running the below query on both of them:

g.V().hasLabel('label_A').has('some_id', 123).has('data.name', 'value1').repeat(both('sample_edge').simplePath()).until(has('data.name', 'value2')).path().by('data.name').next()

The issue is that this query takes ~130ms on the 3 node cluster whereas it takes ~400ms on the 6 node cluster.

I have tried running ".profile()" on both versions and the outputs are almost identical in terms of the steps and time taken.

g.V().hasLabel('label_A').has('some_id', 123).has('data.name', 'value1').repeat(both('sample_edge').simplePath()).until(has('data.name', 'value2')).path().by('data.name').limit(1).profile()

==>Traversal Metrics
Step                                                               Count  Traversers       Time (ms)    % Dur
=============================================================================================================
JanusGraphStep([],[~label.eq(label_A), o...                     1           1           4.582     0.39
    \_condition=(~label = label_A AND some_id = 123 AND data.name = value1)
    \_orders=[]
    \_isFitted=true
    \_isOrdered=true
    \_query=multiKSQ[1]@8000
    \_index=someVertexByNameComposite
  optimization                                                                                 0.028
  optimization                                                                                 0.907
  backend-query                                                        1                       3.012
    \_query=someVertexByNameComposite:multiKSQ[1]@8000
    \_limit=8000
RepeatStep([JanusGraphVertexStep(BOTH,[...                     2           2        1167.493    99.45
  HasStep([data.name.eq(...                                                          803.247
  JanusGraphVertexStep(BOTH,[...                           12934       12934         334.095
    \_condition=type[sample_edge]
    \_orders=[]
    \_isFitted=true
    \_isOrdered=true
    \_query=org.janusgraph.diskstorage.keycolumnvalue.SliceQuery@812d311c
    \_multi=true
    \_vertices=264
    optimization                                                                               0.073
    backend-query                                                    266                       5.640
    \_query=org.janusgraph.diskstorage.keycolumnvalue.SliceQuery@812d311c
    optimization                                                                               0.028
    backend-query                                                  12689                     312.544
    \_query=org.janusgraph.diskstorage.keycolumnvalue.SliceQuery@812d311c
  PathFilterStep(simple)                                           12441       12441          10.980
  JanusGraphMultiQueryStep(RepeatEndStep)                           1187        1187          11.825
  RepeatEndStep                                                        2           2         810.468
RangeGlobalStep(0,1)                                                   1           1           0.419     0.04
PathStep([value(data.name)])                                 1           1           1.474     0.13
                                            >TOTAL                     -           -        1173.969        -

I'd really appreciate some input on figuring out why the query is 3x slower on 6 nodes.

I realise that you may require more context. Happy to provide more information as required!

 Thank you!
 

Varun Ganesh

unread,
Nov 24, 2020, 5:07:58 PM11/24/20
to JanusGraph users
Just an additional note,  you may have noticed that the profile step above shows a time taken of >1000ms. I do not know why this is the case.

When run on the console without profile, it reflects the true time taken:
 gremlin> clockWithResult(10) { graph.tx().rollback(); g.V().hasLabel('label_A').has('some_id', 123).has('data.name', 'value1').repeat(both('sample_edge').simplePath()).until(has('data.name', 'value2')).path().by('data.name').limit(1).next() }
 ==>130.9545608

Thanks!

BO XUAN LI

unread,
Nov 26, 2020, 11:19:32 AM11/26/20
to janusgra...@googlegroups.com
Hi,

> why the query is 3x slower on 6 nodes

Did you check the hardware differences? Probably the 6-node cluster has slower network, less memory, slower disk, etc.
Another possibility that I can think of is, the data involved in your query is probably distributed across nodes. Since your 3-node cassandra cluster has 3x replication factor, I would presume all data you have is available on every node. Then there would be fewer round-trips happening within the 3-node cluster.
Generally it makes sense to me that the latency of a small cluster is shorter than that of a large cluster, as long as both clusters are not fully loaded. Of course with larger cluster you can achieve higher throughput.

> the profile step above shows a time taken of >1000ms

This can be a bug in profiling. If you can provide a minimal example to reproduce, that would be very helpful.

Best regards,
Boxuan


-- 
You received this message because you are subscribed to the Google Groups "JanusGraph users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to janusgraph-use...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/janusgraph-users/6d2483f7-062a-4a95-98b2-6b4aafa87cd3n%40googlegroups.com.

Varun Ganesh

unread,
Nov 30, 2020, 2:23:59 PM11/30/20
to JanusGraph users
Hi Boxuan,

Thank you for getting back to me. Please find my responses below:


> Did you check the hardware differences? 
Yes I can confirm that the two clusters are identical except for the number of nodes.

> the data involved in your query is probably distributed across nodes
This was our initial guess as well. However, if that was the case, we should technically observe this slowness for all the queries that we try. But it is only observed for "path" queries.

For instance, here's an example of another traversal query where we observe the SAME latency across the 3 and 6 node clusters:
g.V().hasLabel('label_B').has('some_id', 123).has('data.name', 1234567).both('sample_edge').valueMap('data.field1', 'data.field2').next(10)


> Then there would be fewer round-trips happening within the 3-node cluster
I also want to point out that we are not running the Janusgraph in embedded mode (where it is colocated with Cassandra), instead it is running separately on its own server nodes

> Of course with larger cluster you can achieve higher throughput
Interestingly we are not observing any difference in the throughput (i.e. the maximum queries per second that can be handled without seeing timeouts) between the two clusters

Would appreciate any input on where/how we could possibly investigate further.

Thank you!
Varun

Varun Ganesh

unread,
Dec 9, 2020, 8:46:34 AM12/9/20
to janusgra...@googlegroups.com
(I had previously posted this on the forum: https://groups.google.com/g/janusgraph-users/c/nkNFaFzdr4I. But I was hoping that I might get a bit more traction through the mailing list)

 Thank you!
Reply all
Reply to author
Forward
0 new messages