Find connected component in janusgraph with ~150M vertices and ~350M edges

155 views
Skip to first unread message

Harshit Sharma

unread,
Jul 19, 2021, 8:20:06 AM7/19/21
to Gremlin-users
Hi Team,

I'm using Janusgraph with Cassandra as backend storage and Elasticsearch for indexing.
My graph consists of around 150M vertices and 350M edges. Now I'm trying to find out connected component using the following query but getting timed out -

g.V().has("name", "person1"). repeat(__.where(without("a")).store("a").both(). simplePath()).emit().hasLabel("person"). dedup(). count().fold()

This query is working fine for data less than 100M but getting timed out for large data.
Is there any optimization I can do in this query or on the Janusgraph side?

HadoopMarc

unread,
Jul 20, 2021, 8:39:22 AM7/20/21
to Gremlin-users
Hi,

A lot of information regarding connected components with gremlin and its scaling can be found in:

I did not check whether your query has all the optimizations presented in the recipes.

Best wishes,    Marc

Op maandag 19 juli 2021 om 14:20:06 UTC+2 schreef harshit.s...@gmail.com:

Harshit Sharma

unread,
Jul 20, 2021, 9:36:15 AM7/20/21
to gremli...@googlegroups.com
IS there a way I can optimize this connected component query?

--
You received this message because you are subscribed to the Google Groups "Gremlin-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gremlin-user...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/gremlin-users/be141f52-4fcc-44e7-9646-63741a7eb626n%40googlegroups.com.


--
Regards,

Harshit Sharma

HadoopMarc

unread,
Jul 21, 2021, 4:02:12 AM7/21/21
to Gremlin-users
Hi Harshit,

There are two ways to run queries for connected components on larger graphs:

1. Prevent the timeout you mention, but which timeout? If it is the default 30,000 ms evaluationTimeout of Gremlin Server, you can simply increase the value in its properties file.
2. Use the TinkerPop connectedComponent() step in conjunction with withComputer() or withComputer(SparkGraphComputer)

You will have to experiment which one works best for JanusGraph.

Best wishes,     Marc

Op dinsdag 20 juli 2021 om 15:36:15 UTC+2 schreef harshit.s...@gmail.com:

Harshit Sharma

unread,
Jul 21, 2021, 10:12:23 AM7/21/21
to Gremlin-users
Thanks for the quick response.
Actually, we are not looking for an OLAP query, as we are not finding all the connected components.
As per our use case, we have a vertex id as an input and we have to find all the nodes and count of nodes in that connected components.

HadoopMarc

unread,
Jul 21, 2021, 2:57:08 PM7/21/21
to Gremlin-users
Yes, you are right, OLAP does not make much sense for your use case.
Regarding the timeout you mention, can you give more details?
It is also not clear to me why your query on the 300M graph should take much longer than on the 100M graph, unless the "name" property is not indexed or the connected component in the 300M graph is much larger. Did you give this some thought?

Marc

Op woensdag 21 juli 2021 om 16:12:23 UTC+2 schreef harshit.s...@gmail.com:

Harshit Sharma

unread,
Jul 22, 2021, 1:10:00 AM7/22/21
to Gremlin-users

I'm trying to find all the nodes in a connected component in a graph, which contains around ~130M vertices and ~350M edges.

Following is the query I'm using to find the count of nodes in connected components -

Input - starting vertex id/name

Output - count of nodes in the connected component.

Query - g.v().has("name", "driver1").repeat(where(without ("a")).store("a").both().simplePath().dedup()).emit().hasLabel("driver").count().fold()


The above query is taking around ~ 52 sec

RepeatStep is taking around ~ 29 sec

Is there a way we can optimize the linear traversal in Repeatstep or lookup in WherePredicateatep?

Profile output of above query- 

"dur": 29008.200345,

"counts": ("traverserCount": 13809,"elementCount": 13809},

name: "RepeatStep ([Where PredicateStep (without ([a])), Profilestep, Storestep (a), Profilestep, JanuaGraphVertexStep(BOTH, vertex), ProfileStep, PathFilterstep(simple), Profilestep, RepeatEndstep, Profilestep], until(false), emit(true))",

"annotations":{

"percentDur": 52.557919400750215},

"id": "2.0.0()",

"metrics": [

{

"dur": 38.137699,

"counts":{

traverserCount: 13810,

elementCount: 13810},

"name":"WherePredicateStep(without([a]))",

"id": "0.1.0 (2.0.0())"

},

{

"dur": 28628.594393, 

"counts": {

"traverserCount": 252428, 

"elementCount": 252428

},

name: "JanusGraphVertexStep (BOTH, vertex)", "annotations"

HadoopMarc

unread,
Jul 24, 2021, 9:13:35 AM7/24/21
to Gremlin-users
The last follow-up question was reposted as a separate thread (good idea!)

Op donderdag 22 juli 2021 om 07:10:00 UTC+2 schreef harshit.s...@gmail.com:
Reply all
Reply to author
Forward
0 new messages