Aggregating edges based on the source & target vertex attributes

vishnu gajendran

unread,

Dec 17, 2020, 2:57:38 AM12/17/20

to JanusGraph users

Hello,

I request your help regarding the janus graph query which I am trying to construct. Let's consider the following graph where each vertex denotes a person and the edge between any two vertex denotes collaboration between them.

Vertices:

p1 = graph.addVertex('person')

p1.property('personId', 1)

p1.property('organization', "engineering")

p2 = graph.addVertex('person')

p2.property('personId', 2)

p2.property('organization', "sales")

p3 = graph.addVertex('person')

p3.property('personId', 3)

p3.property('organization', "marketing")

p4 = graph.addVertex('person')

p4.property('personId', 4)

p4.property('organization', "engineering")

Edges:

p1.addEdge('collaboration', p2, 'collaborationHours', 1)

p1.addEdge('collaboration', p3, 'collaborationHours', 2)

p2.addEdge('collaboration', p3, 'collaborationHours', 2)

p3.addEdge('collaboration', p4, ' collaborationHours', 2)

p4.addEdge('collaboration', p2, ' collaborationHours', 2)

Expected Result is the following table:

Organization1 Organization2 Total Collaboration Hours
Engineering Sales 4
Engineering Marketing 2
Sales Marketing 2
Marketing Engineering 2

Here, I am trying to aggregate the "person to person" graph into "organization to organization" graph. Does JanusGraph support such aggregation queries? If yes, can you please help me with the query for the same?

Thanks

Kevin Schmidt

unread,

Dec 17, 2020, 8:50:11 AM12/17/20

to JanusGraph users list

Vishu,

This may not be optimal, but seems to work:

g.E().hasLabel('collaboration').as('e').outV().values('organization').as('1').select('e').inV().values('organization').as('2').select('e').group().by(select('1', '2')).by(values('collaborationHours').sum()).unfold();

==>{1=engineering, 2=marketing}=2
==>{1=marketing, 2=engineering}=2
==>{1=engineering, 2=sales}=3
==>{1=sales, 2=marketing}=2

Note, you have some leading spaces in your Gremlin on 'collaborationHours' I had to remove, and with the data you provided the engineering/sales total is 3, not 4.

Kevin

--
You received this message because you are subscribed to the Google Groups "JanusGraph users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to janusgraph-use...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/janusgraph-users/5fb20cf1-0aeb-4128-91da-857ec6295587n%40googlegroups.com.

HadoopMarc

unread,

Dec 17, 2020, 9:10:57 AM12/17/20

to JanusGraph users

And here a small variation without the keys and with some code formatting:

g.V().as('a').outE().as('e').inV().as('b').
    group().by(
        union(select('a').values('organization'), select('b').values('organization')).fold()
    ).by(
        select('e').by('collaborationHours').sum()
    ).unfold()

==>[marketing, engineering]=2
==>[sales, marketing]=2
==>[engineering, sales]=3
==>[engineering, marketing]=2

Marc

Op donderdag 17 december 2020 om 14:50:11 UTC+1 schreef ktsc...@gmail.com:

Kevin Schmidt

unread,

Dec 17, 2020, 9:21:56 AM12/17/20

to JanusGraph users list

Thanks for improving it! Always good to learn more.

To view this discussion on the web visit https://groups.google.com/d/msgid/janusgraph-users/16a43300-59ae-4727-920c-b5d4cdb96820n%40googlegroups.com.

vishnu gajendran

unread,

Dec 21, 2020, 3:16:08 AM12/21/20

to JanusGraph users

Thank you Kevin and Marc for quick response. I tried both the queries and they are working as expected. My use case demands to run such query for a bigger dataset. I ran the query for 1 lakh vertices and 5 million edges in my desktop using the in-memory backend (assuming that in-memory will be faster compared to other external data stores) and it took roughly 2 minutes to execute. My desktop contains 8 logical cores and 64 GB RAM. Few questions regarding the same:

1. Is this the expected performance for such aggregation queries in JanusGraph?

2. Will increasing the number of cores (i.e. processing power) improve the performance of the query?

The dataset I am dealing with can be as big as 1.5 lakh vertices and 20 million edges and I would like to support the above aggregation query in real time (i.e. in few seconds and not in minutes). Can we achieve the same using JanusGraph?

HadoopMarc

unread,

Dec 21, 2020, 6:49:56 AM12/21/20

to JanusGraph users

Hi Vishnu,

The processing time does not really surprise me, JanusGraph has to do everything in java. For the typical JanusGraph use case, the storage backend is the limiting factor and the java processing does not really matter. If you want to do this query fast in memory with multiple cores, you are better off with python dask or the like (and do the aggregation on a single dataframe with the edge id, inV label and outV label). I would not be surprised if pandas, using a single core, already does this within a second.

For the queries given above I believe only a single core is used when run as OLTP query. Because this N x N query is not easy to parallelize for TinkerPop, you have to take care how to run it as OLAP query. I would guess that with(SparkGraphComputer) with a single spark executor with 8 cores will work best because then the spark cores share the memory. This is automatically true for spark.master=local[*] .

Best wishes, Marc

PS Thanks for introducing me into the Indian numbering system. Happily, you do not have 1.5 crore vertices!

Op maandag 21 december 2020 om 09:16:08 UTC+1 schreef vishnu gajendran:

vishnu gajendran

unread,

Dec 24, 2020, 5:23:38 AM12/24/20

to JanusGraph users

Thank you Marc. As you mentioned I might be able to execute the above mentioned aggregation query faster if I use other tools/datastore. But, I was exploring JanusGraph primarily for OLAP use-cases. For example, running graph algorithms like page rank, BFS etc... on the fly and for graph visualization where in I would like to aggregate the nodes & edges on the fly based on user selected node/edge attributes. I was thinking that JanusGraph might be optimal for these use-cases. Correct me if I am wrong.

I will also try the SparkGraphComputer as you suggested.

Reply all

Reply to author

Forward