High amount of edges for a single node handling

50 views
Skip to first unread message

vincent2...@gmail.com

unread,
Jan 17, 2020, 2:15:18 AM1/17/20
to JanusGraph users
Hi All,
We are currently facing an issue where there is too many edges (~1m) from node hence it either takes a long time or it doesn’t get back at all and UI cannot handle (does not need to handle the heavy load) . We are looking at filtering based on a edge property that is the count of edges to another node ( there may be many edges to another node ) and sort the outgoing nodes by the max count of edges from the primary node and filter it (50-100) nodes. We are facing some performance issues with this as the filter is on the out going node and not the edges itself. Any easier way to handle this.

marc.de...@gmail.com

unread,
Jan 17, 2020, 3:11:04 AM1/17/20
to JanusGraph users
Hi Vincent,

First look into:

HTH,    Marc

Op vrijdag 17 januari 2020 08:15:18 UTC+1 schreef vincent2...@gmail.com:

marc.de...@gmail.com

unread,
Jan 17, 2020, 3:19:00 AM1/17/20
to JanusGraph users
Hi Vincent,

I have a hard time understanding your issue. Could you provide some drawing to illustrate where the high edge counts are, what your current query is and what result you expect from the query?

Cheers,   Marc

Op vrijdag 17 januari 2020 09:11:04 UTC+1 schreef HadoopMarc:

Stephen Mallette

unread,
Jan 17, 2020, 6:43:19 AM1/17/20
to janusgra...@googlegroups.com
>  the filter is on the out going node and not the edges itself. 

It sounds as though you might need to denormalize your data a bit and duplicate that vertex data to your edges so that you can index those edges appropriately. 

On Fri, Jan 17, 2020 at 2:15 AM <vincent2...@gmail.com> wrote:
Hi All,
We are currently facing an issue where there is too many edges (~1m) from node hence it either takes a long time or it doesn’t get back at all and UI cannot handle (does not need to handle the heavy load) . We are looking at filtering based on a edge property that is the count of edges to another node ( there may be many edges to another node ) and sort the outgoing nodes by the max count of edges from the primary node and filter it (50-100) nodes. We are facing some performance issues with this as the filter is on the out going node and not the edges itself. Any easier way to handle this.

--
You received this message because you are subscribed to the Google Groups "JanusGraph users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to janusgraph-use...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/janusgraph-users/ed6ab203-7edc-4ce9-acd6-d332b9cd33dd%40googlegroups.com.

amiyakr...@gmail.com

unread,
Jan 17, 2020, 8:39:05 AM1/17/20
to JanusGraph users
Hi Marc,
Thanks for response.

Entities in the graph are, group and users are vertex. Users post in the group very frequently. Total number of posts in a day varies 1M to 2M.

Each post is an edge from user to group.

An user can send message to other users. Some Users are admin and moderator of the group. That information stored as relationship edge.

We are trying to get info about who posts more in the group for a date range varies from 6 months to 1 year. i.e. show top X number of users who posts highest posts.

Using vertex centric index on time property will be helpful to filter the edges for that period. But to get the highest number of posts we still have to process millions of edges
To group by out and in vertex, sum the number of posts, then order by the aggregate value in desc and fetch top X number.


What is the recommended solution for this?

marc.de...@gmail.com

unread,
Jan 17, 2020, 10:05:41 AM1/17/20
to JanusGraph users

To add on Stephen's suggestion, it might be the case that you do something like:

g.V().has('group', 'group42').in('post').group().by('username').by(count())

This would retrieve all connected user vertices and cause a great load on both janusgraph and the backend. If you would store the username on the post edge, the query would become something like:

g.V().has('group', 'group42').inE('post').group().by('username').by(count()) 

This would only cause retrieval of the group's edge properties and still allow you to do the group counts.

How does your query look like?

Cheers,     Marc

Op vrijdag 17 januari 2020 14:39:05 UTC+1 schreef Amiya:

marc.de...@gmail.com

unread,
Jan 17, 2020, 12:17:23 PM1/17/20
to JanusGraph users
I realized later on that the groupBy on the in edges can be made even more efficient in JanusGraph, realizing that JG stores RelationIdentifiers as id's. The Titan people who devised this must have had your use case in mind :-)

So we get:

         \,,,/
         
(o o)
-----oOOo-(3)-oOOo-----
plugin activated
: janusgraph.imports
plugin activated
: tinkerpop.server
plugin activated
: tinkerpop.utilities
plugin activated
: tinkerpop.hadoop
plugin activated
: tinkerpop.spark
plugin activated
: tinkerpop.tinkergraph
gremlin
> graph = JanusGraphFactory.open("inmemory")
==>standardjanusgraph[inmemory:[127.0.0.1]]
gremlin
> GraphOfTheGodsFactory.loadWithoutMixedIndex(graph,true)
==>null
gremlin
> g = graph.traversal()
==>graphtraversalsource[standardjanusgraph[inmemory:[127.0.0.1]], standard]

gremlin
> a = g.V().bothE().id().next()
18:03:37 WARN  org.janusgraph.graphdb.transaction.StandardJanusGraphTx  - Query requires iterating over all vertices [()]. For better performance, use indexes
==>35u-368-6c5-38w
gremlin
> a.getClass()
==>class org.janusgraph.graphdb.relations.RelationIdentifier
gremlin
> a.getInVertexId()
==>4208
gremlin
> a.getOutVertexId()
==>4112

gremlin
> g.V().inE().group().by(id().map{it->it.get().getOutVertexId()}).by(count()).unfold()
18:09:38 WARN  org.janusgraph.graphdb.transaction.StandardJanusGraphTx  - Query requires iterating over all vertices [()]. For better performance, use indexes
==>4240=3
==>8208=1
==>4112=4
==>4120=5
==>4232=4
gremlin
>



Cheers,   Marc

Op vrijdag 17 januari 2020 16:05:41 UTC+1 schreef HadoopMarc:

Peter Corless

unread,
Jan 17, 2020, 3:27:24 PM1/17/20
to janusgra...@googlegroups.com
I know I am somewhat biased since I work here, but have you tried testing on ScyllaDB as a backing store? It might be able to scale JanusGraph better.

What's your backing store right now?

Also, can you describe the hardware you're running on? CPUs/cores per node, RAM and Storage per node? (i.e., to rule out things like running from HDDs [shudder] vs. SSDs, too little CPU horsepower, RAM limits, etc.)

On Thu, Jan 16, 2020, 11:15 PM <vincent2...@gmail.com> wrote:
Hi All,
We are currently facing an issue where there is too many edges (~1m) from node hence it either takes a long time or it doesn’t get back at all and UI cannot handle (does not need to handle the heavy load) . We are looking at filtering based on a edge property that is the count of edges to another node ( there may be many edges to another node ) and sort the outgoing nodes by the max count of edges from the primary node and filter it (50-100) nodes. We are facing some performance issues with this as the filter is on the out going node and not the edges itself. Any easier way to handle this.

Amiya

unread,
Jan 21, 2020, 11:04:08 AM1/21/20
to JanusGraph users
Hi Marc,

Thanks. Primary testing against sample data shows use of RelationIdentifier improve the performance in great way. I will check against our exact failure data. 

This is similar what we use currently. 

g.V().has('groupId', 'grp14').as('p').
   inE
('post').has('date', between(fromDate, endDate)).outV().as('ip').
      dedup
('p', 'ip').
          store
('paths').by(
             
union(
                   
select('p').valueMap('groupName', 'groupType').project('from'),
                   valueMap
('userId', 'userName').project('to'),
                   
select('p'). inE('post').has('date', between(fromDate, endDate)).as('e').outV().where(eq('ip')).select('e').
                         
union(
                            values
('date').dedup().fold().project('dates'),
                            values
('text').count().project('count')
                         
).fold().project('postLink')
             
).fold()
         
)


I will re-write this query to get the required output with group by with RelationIdentifier, Let's see he performance. 

Thanks for the reference BTW.

Cheers,
Amiya

Amiya

unread,
Jan 21, 2020, 11:08:31 AM1/21/20
to JanusGraph users
Hi Peter,

Currently we are using HBase. We have not tested with ScyllaDB, but yes it's in our plan to test with Casandra/ScyllaDB. 

About the configuration Let me check and get back.

Thanks,
Amiya


On Saturday, 18 January 2020 01:57:24 UTC+5:30, Peter Corless wrote:
I know I am somewhat biased since I work here, but have you tried testing on ScyllaDB as a backing store? It might be able to scale JanusGraph better.

What's your backing store right now?

Also, can you describe the hardware you're running on? CPUs/cores per node, RAM and Storage per node? (i.e., to rule out things like running from HDDs [shudder] vs. SSDs, too little CPU horsepower, RAM limits, etc.)
On Thu, Jan 16, 2020, 11:15 PM <vincent2...@gmail.com> wrote:
Hi All,
We are currently facing an issue where there is too many edges (~1m) from node hence it either takes a long time or it doesn’t get back at all and UI cannot handle (does not need to handle the heavy load) . We are looking at filtering based on a edge property that is the count of edges to another node ( there may be many edges to another node ) and sort the outgoing nodes by the max count of edges from the primary node and filter it (50-100) nodes. We are facing some performance issues with this as the filter is on the out going node and not the edges itself. Any easier way to handle this.

--
You received this message because you are subscribed to the Google Groups "JanusGraph users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to janusgra...@googlegroups.com.

On Saturday, 18 January 2020 01:57:24 UTC+5:30, Peter Corless wrote:
I know I am somewhat biased since I work here, but have you tried testing on ScyllaDB as a backing store? It might be able to scale JanusGraph better.

What's your backing store right now?

Also, can you describe the hardware you're running on? CPUs/cores per node, RAM and Storage per node? (i.e., to rule out things like running from HDDs [shudder] vs. SSDs, too little CPU horsepower, RAM limits, etc.)
On Thu, Jan 16, 2020, 11:15 PM <vincent2...@gmail.com> wrote:
Hi All,
We are currently facing an issue where there is too many edges (~1m) from node hence it either takes a long time or it doesn’t get back at all and UI cannot handle (does not need to handle the heavy load) . We are looking at filtering based on a edge property that is the count of edges to another node ( there may be many edges to another node ) and sort the outgoing nodes by the max count of edges from the primary node and filter it (50-100) nodes. We are facing some performance issues with this as the filter is on the out going node and not the edges itself. Any easier way to handle this.

--
You received this message because you are subscribed to the Google Groups "JanusGraph users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to janusgra...@googlegroups.com.

On Saturday, 18 January 2020 01:57:24 UTC+5:30, Peter Corless wrote:
I know I am somewhat biased since I work here, but have you tried testing on ScyllaDB as a backing store? It might be able to scale JanusGraph better.

What's your backing store right now?

Also, can you describe the hardware you're running on? CPUs/cores per node, RAM and Storage per node? (i.e., to rule out things like running from HDDs [shudder] vs. SSDs, too little CPU horsepower, RAM limits, etc.)
On Thu, Jan 16, 2020, 11:15 PM <vincent2...@gmail.com> wrote:
Hi All,
We are currently facing an issue where there is too many edges (~1m) from node hence it either takes a long time or it doesn’t get back at all and UI cannot handle (does not need to handle the heavy load) . We are looking at filtering based on a edge property that is the count of edges to another node ( there may be many edges to another node ) and sort the outgoing nodes by the max count of edges from the primary node and filter it (50-100) nodes. We are facing some performance issues with this as the filter is on the out going node and not the edges itself. Any easier way to handle this.

--
You received this message because you are subscribed to the Google Groups "JanusGraph users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to janusgra...@googlegroups.com.

On Saturday, 18 January 2020 01:57:24 UTC+5:30, Peter Corless wrote:
I know I am somewhat biased since I work here, but have you tried testing on ScyllaDB as a backing store? It might be able to scale JanusGraph better.

What's your backing store right now?

Also, can you describe the hardware you're running on? CPUs/cores per node, RAM and Storage per node? (i.e., to rule out things like running from HDDs [shudder] vs. SSDs, too little CPU horsepower, RAM limits, etc.)
On Thu, Jan 16, 2020, 11:15 PM <vincent2...@gmail.com> wrote:
Hi All,
We are currently facing an issue where there is too many edges (~1m) from node hence it either takes a long time or it doesn’t get back at all and UI cannot handle (does not need to handle the heavy load) . We are looking at filtering based on a edge property that is the count of edges to another node ( there may be many edges to another node ) and sort the outgoing nodes by the max count of edges from the primary node and filter it (50-100) nodes. We are facing some performance issues with this as the filter is on the out going node and not the edges itself. Any easier way to handle this.

--
You received this message because you are subscribed to the Google Groups "JanusGraph users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to janusgra...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages