OLAP graph filter to execute query on a neighborhood of a vertex

77 views
Skip to first unread message

Yevgeniy Ignatyev

unread,
Mar 21, 2018, 7:13:03 AM3/21/18
to Gremlin-users
Hello.

The problem I am trying to solve is that we have a large graph, but need to execute OLAP only on a 2-hop network of a selected node, e.g. PageRank.
Currently, as each vertex has not very large surrounding network, we select all the vertices and edges from the 2-hop network, build in-memory TinkerGraph and run .pageRank step on it. While trying to rewrite the query to be handled by OLAP on the full graph we found that, if we ran PageRank calculation on the in-memory graph built from "edges" and "vertices", we get results different from the case using SubgraphStrategy as a GraphFilter with the OLAP on the whole graph. Discrepancies are both with the absolute ranking of vertices and their relative rank order.

Here is the example of executing both of the queries on the synthetic graph (based on our graph structure), that outputs vertices numbers and their ranks to console: https://github.com/YevIgn/tinkerpop-subgraph-pagerank

Could you, please, explain why the results are different and considering the small size of the subgraphs, we usually get, will there be any benefit of calculating page rank on OLAP traversal with filter on the whole graph?

Also, can the query with graph filtering be rewritten in a way that it doesn't require materialization of vertices collection in-memory? Up to the current time I wasn't able to figure out how to circumvent the restrictions on vertices/edges filters, does it require custom TraversalStrategy implementation?

Best regards,
Evgeniy Ignatiev.

Daniel Kuppitz

unread,
Mar 21, 2018, 1:15:07 PM3/21/18
to gremli...@googlegroups.com
The PageRankVertexProgram and the SubGraphStrategy do not work together. PageRankVertexProgram ignores (or rather is not aware of) the SubgraphStrategy filter when it sends its message, hence you'll end up having a completely different result. What you really need, are graph filters (push-down predicates) that filter the elements of an OLAP graph before any other operation is applied. I tweaked your code a bit and created a Gist that shows how to use a graph filter:


Cheers,
Daniel


--
You received this message because you are subscribed to the Google Groups "Gremlin-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gremlin-users+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/gremlin-users/7cd0d7ab-963a-4a78-be32-04d2a4f5840e%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Yevgeniy Ignatyev

unread,
Mar 22, 2018, 8:14:36 AM3/22/18
to Gremlin-users
Thanks a lot!

Also, could you please, explain why the following query to get only vertex/edge ids returns incomplete results? - i.e. not all edges that actually are in a subgraph:

g
 
.V()
 
.has(NUMBER, 204984)
 
.emit()
 
.repeat(__.bothE().dedup().store(EDGES).otherV())
 
.times(2)
 
.dedup()
 
.aggregate(VERTICES).by(T.id)
 
.bothE()
 
.where(P.without(EDGES))
 
.as(EDGE)
 
.otherV()
 
.where(__.hasId(P.within(VERTICES)))
 
.select(EDGE)
 
.store(EDGES)
 
.cap(VERTICES, EDGES)
 
.next();

What changes does "by" modulator cause that lead to this behavior? The goal of the query is to pull only ids to application, as vertices/edges have a number of properties that will unnecessarily consume memory.

The query correctly returns vertices ids, but as far as I understand the documentation, graph filter does not impose restriction on edges that only an edge to which both incident vertices pass vertex filter is selected, so I see results different from the calculation on the subgraph.

Best regards,
Evgeniy Ignatiev.
To unsubscribe from this group and stop receiving emails from it, send an email to gremlin-user...@googlegroups.com.

Daniel Kuppitz

unread,
Mar 22, 2018, 12:53:55 PM3/22/18
to gremli...@googlegroups.com
I'm not sure where you see a difference. I get the same result with and without the by() modulator. In both cases 4678 vertices and 6881 edges. Also, the PR values seem to be exactly the same.

Cheers,
Daniel


To unsubscribe from this group and stop receiving emails from it, send an email to gremlin-users+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/gremlin-users/17a17f00-7308-493e-bd40-41fd643f2986%40googlegroups.com.

Evgeniy Ignatiev

unread,
Mar 23, 2018, 3:59:17 AM3/23/18
to gremli...@googlegroups.com

Actually I see 7077 edges without by() modulator.

Best regards,
Evgeniy Ignatiev.

Evgeniy Ignatiev

unread,
Mar 23, 2018, 10:59:29 AM3/23/18
to gremli...@googlegroups.com

I pushed a commit to the https://github.com/YevIgn/tinkerpop-subgraph-pagerank which adds code from your gist and also printing results for the case when ids of vertices and edges are extracted.

Attached console log of what I get from running it - from the first two runs I get 7077 edges, from the third one (ids loaded only) - 6881.

Do I miss something that gives these discrepancy?

Best regards,
Evgeniy Ignatiev.


On 3/22/2018 8:53 PM, Daniel Kuppitz wrote:
main.log

Daniel Kuppitz

unread,
Mar 23, 2018, 11:59:43 AM3/23/18
to gremli...@googlegroups.com
Oh right, I forgot that I fixed that. Your id query is not quite right, here's the patch:

diff --git a/src/main/java/test/tinkerpop/subgraph/pagerank/Main.java b/src/main/java/test/tinkerpop/subgraph/pagerank/Main.java
index 4f25d22..2f56a84 100644
--- a/src/main/java/test/tinkerpop/subgraph/pagerank/Main.java
+++ b/src/main/java/test/tinkerpop/subgraph/pagerank/Main.java
@@ -150,10 +150,10 @@ public class Main {
                 .dedup()
                 .aggregate(VERTICES).by(T.id)
                 .bothE()
-                .where(P.without(EDGES))
+                .where(P.without(EDGES)).by(T.id).by()
                 .as(EDGE)
                 .otherV()
-                .where(__.hasId(P.within(VERTICES)))
+                .where(P.within(VERTICES)).by(T.id).by()
                 .select(EDGE)
                 .store(EDGES).by(T.id)
                 .cap(VERTICES, EDGES)

Cheers,
Daniel


--
You received this message because you are subscribed to the Google Groups "Gremlin-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gremlin-users+unsubscribe@googlegroups.com.

Evgeniy Ignatiev

unread,
Mar 26, 2018, 7:09:34 AM3/26/18
to gremli...@googlegroups.com

Thank you! Now I see the same results for all queries. Also could you, please, explain why the results are different while using hasId? I have not found any information on this point in Where step documentation.

Best regards,
Evgeniy Ignatiev.

Daniel Kuppitz

unread,
Mar 26, 2018, 9:11:57 AM3/26/18
to gremli...@googlegroups.com
hasId() will interpret VERTICES as what it is - a String.

Cheers,
Daniel

--
You received this message because you are subscribed to the Google Groups "Gremlin-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gremlin-users+unsubscribe@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages