"connected components" on a directed graph

Jen

unread,

Jul 18, 2016, 5:30:55 PM7/18/16

to Gremlin-users

Hi,

I am trying to create an example query that returns two "connected components". I know they're not true connected components because it's a directed property graph, but I would like to return two connected clusters. I've created a simple example with two clusters:

graph = TinkerGraph.open()

client1 = graph.addVertex(label,'client','name','client1')

client2 = graph.addVertex(label,'client','name','client2')

client3 = graph.addVertex(label,'client','name','client3')

client4 = graph.addVertex(label,'client','name','client4')

order1 = graph.addVertex(label,'order','name','order1')

order2 = graph.addVertex(label,'order','name','order2')

order3 = graph.addVertex(label,'order','name','order3')

order4 = graph.addVertex(label,'order','name','order4')

order5 = graph.addVertex(label,'order','name','order5')

order6 = graph.addVertex(label,'order','name','order6')

order7 = graph.addVertex(label,'order','name','order7')

payment1 = graph.addVertex(label,'payment','name','payment1')

payment2 = graph.addVertex(label,'payment','name','payment2')

payment3 = graph.addVertex(label,'payment','name','payment3')

payment4 = graph.addVertex(label,'payment','name','payment4')

payment5 = graph.addVertex(label,'payment','name','payment5')

payment6 = graph.addVertex(label,'payment','name','payment6')

payment7 = graph.addVertex(label,'payment','name','payment7')

item1 = graph.addVertex(label,'item','name','item1')

item2 = graph.addVertex(label,'item','name','item2')

item3 = graph.addVertex(label,'item','name','item3')

item4 = graph.addVertex(label,'item','name','item4')

order1.addEdge('order_client',client1)

order1.addEdge('order_item',item1)

order2.addEdge('order_item',item1)

order2.addEdge('order_client',client2)

order3.addEdge('order_client',client2)

order3.addEdge('order_item',item2)

order4.addEdge('order_item',item3)

order4.addEdge('order_client',client3)

order5.addEdge('order_client',client3)

order5.addEdge('order_item',item4)

order6.addEdge('order_client',client3)

order6.addEdge('order_item',item4)

order7.addEdge('order_item',item4)

order7.addEdge('order_client',client4)

order1.addEdge('order_payment',payment1)

order2.addEdge('order_payment',payment2)

order3.addEdge('order_payment',payment3)

order4.addEdge('order_payment',payment4)

order5.addEdge('order_payment',payment5)

order6.addEdge('order_payment',payment6)

order7.addEdge('order_payment',payment7)

g = graph.traversal()

I would like to return the clients and items from each cluster. So (client1, client2, item1, item2) would be a cluster and (client2, client4, item3, item4) would be the other cluster.

I tried the connected components algorithm Daniel posted but that didn't return the two clusters I wanted. It returned:

==>[client1, order1, item1, payment1, order2]

==>[payment2, client2]

==>[order3, item2, payment3]

==>[payment6, order6, item4, client3, order5, order7, order4, payment5, payment7, client4, item3, payment4]

I also tried:

g.V().hasLabel('order').repeat(both().has(label,within('client','item')).both().hasLabel('order')).until(cyclicPath()).dedup().tree().by('name')

==>[order4:[client3:[order5:[client3:[order6:[:]], item4:[order6:[item4:[order7:[:]]], order5:[:]]]], item3:[order4:[:]]],

order1:[item1:[order2:[item1:[order2:[:]], client2:[order3:[item2:[order3:[:]]]]], order1:[:]]]]

But the dedup step drops off some of the nodes (e.g. client1 and client4) for some reason, and without the dedup, too many clusters are returned.

Any help would be appreciated. Let me know if you need more information.

-Jen

Daniel Kuppitz

unread,

Jul 19, 2016, 4:28:11 AM7/19/16

to gremli...@googlegroups.com

Hi Jen,

not sure what your expected result is, but I guess this one is close:

gremlin> g.V().where(without("x")).emit(cyclicPath()).
           repeat(store("x").both()).until(cyclicPath()).path().
           group().by(limit(local, 1)).
                   by(unfold().has(label, within("client","item")).dedup().values("name").fold()).
           select(values).unfold()
==>[client1, item1, client2, item2]
==>[item4, client3, client4, item3]

Cheers,

Daniel

--
You received this message because you are subscribed to the Google Groups "Gremlin-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gremlin-user...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/gremlin-users/c02850de-f013-42f9-bb34-ca93051e54f9%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Daniel Kuppitz

unread,

Jul 19, 2016, 4:33:35 AM7/19/16

to gremli...@googlegroups.com

Ah, you've actually mentioned your expected result, so this should be exactly what you were looking for.

However, what I forgot to mention: The main problem for this solution is that it only works in OLTP mode. I really don't know how to do it in OLAP, but I would probably write a custom vertex program.

Cheers,

Daniel

Jen

unread,

Jul 19, 2016, 9:41:13 AM7/19/16

to Gremlin-users

Thanks Daniel!

Does this not work in OLAP mode because of the OLAP local star graph limitation, or because of the limitation mixing OTLP and OLAP traversals in Tinkerpop 3.1.1? I need to do this in OLAP mode on Tinkerpop 3.1.1 so I will have to figure out how to adapt this for that.

-Jen

Daniel Kuppitz

unread,

Jul 19, 2016, 9:51:08 AM7/19/16

to gremli...@googlegroups.com

It doesn't work in because of BFS. store() in OLAP is like aggregate() in OLTP.

I will play with it a little more as part of my manual testing for the new releases; maybe I can find a query that is unreadable for everybody, but ultimately does the job :).

Cheers,

Daniel

To view this discussion on the web visit https://groups.google.com/d/msgid/gremlin-users/78bd05f0-5eb6-4286-b616-44a4e3b4b162%40googlegroups.com.

Daniel Kuppitz

unread,

Jul 19, 2016, 11:20:50 AM7/19/16

to gremli...@googlegroups.com

Crazy, found an OLAP solution that is able to determine all clusters. Due to star-graph limitations you won't have access to properties and labels though. However, you can determine clusters via OLAP and do the rest via OLTP (should be blazing fast as you access vertices directly by their ids).

gremlin> a = graph.traversal().withComputer()
==>graphtraversalsource[tinkergraph[vertices:22 edges:21], graphcomputer]
gremlin> g = graph.traversal()
==>graphtraversalsource[tinkergraph[vertices:22 edges:21], standard]

gremlin> clusters = a.V().emit(cyclicPath()).repeat(both()).until(cyclicPath()).path().aggregate("p").cap("p").unfold().limit(local, 1).map(__.as("v").select("p").unfold().filter(unfold().where(eq("v"))).unfold().dedup().order().by(id).fold()).toSet()
==>[v[0], v[2], v[8], v[10], v[12], v[22], v[24], v[26], v[36], v[38]]
==>[v[4], v[6], v[14], v[16], v[18], v[20], v[28], v[30], v[32], v[34], v[40], v[42]]
gremlin> clusters.collect { g.V(it).has(label, within("client", "item")).values("name").toList() }
==>[client1, client2, item1, item2]
==>[client3, client4, item3, item4]

Cheers,

Daniel

Jen

unread,

Jul 19, 2016, 1:22:20 PM7/19/16

to Gremlin-users

Fantastic! I look forward to trying out your OLAP solution. Thanks Daniel!

-Jen

Daniel Kuppitz

unread,

Jul 20, 2016, 5:43:45 AM7/20/16

to gremli...@googlegroups.com

Here's my general solution to the problem (a cluster can also be a single disconnected vertex, my previous query didn't catch that case):

g.V().emit(cyclicPath().or().not(both())).repeat(both()).until(cyclicPath()).
aggregate("p").by(path()).cap("p").unfold().limit(local, 1).dedup().map(
    __.as("v").select("p").unfold().
    filter(unfold().where(eq("v"))).
    unfold().dedup().order().by(id).fold()
).dedup()

Sample usage:

gremlin> g = TinkerGraph.open().traversal()
==>graphtraversalsource[tinkergraph[vertices:0 edges:0], standard]
gremlin> g.addV().addV().as("a").addV().addE("link").from("a")
==>e[3][1-link->2]
gremlin>
gremlin> g.V().emit(cyclicPath().or().not(both())).repeat(both()).until(cyclicPath()).aggregate("p").by(path()).cap("p").unfold().limit(local, 1).dedup().map(__.as("v").select("p").unfold().filter(unfold().where(eq("v"))).unfold().dedup().order().by(id).fold()).dedup()
==>[v[0]]
==>[v[1], v[2]]
gremlin> g.V().emit(cyclicPath().or().not(both())).repeat(both()).until(cyclicPath()).
gremlin>   aggregate("p").by(path()).cap("p").unfold().limit(local, 1).dedup().map(
gremlin>     __.as("v").select("p").unfold().
gremlin>     filter(unfold().where(eq("v"))).
gremlin>     unfold().dedup().order().by(id).fold()
gremlin>   ).dedup()
==>[v[0]]
==>[v[1], v[2]]

Note that the traversal uses dedup() (3 times!). Especially in higher connected graphs these dedup()'s should significantly lower the memory consumption and thus also improve the query performance.
Here's the problem though: the dedup()'s are causing trouble in OLAP:

gremlin> g.withComputer().V().emit(cyclicPath().or().not(both())).repeat(both()).until(cyclicPath()).
gremlin>   aggregate("p").by(path()).cap("p").unfold().limit(local, 1).dedup().map(
gremlin>     __.as("v").select("p").unfold().
gremlin>     filter(unfold().where(eq("v"))).
gremlin>     unfold().dedup().order().by(id).fold()
gremlin>   ).dedup()
java.lang.RuntimeException: java.lang.IllegalStateException: java.lang.IllegalArgumentException: The memory can only be set() during vertex program setup and terminate: p
Display stack trace? [yN]

The workaround is simple (but inefficient): just get rid of the first and the last dedup() and use toSet() in order to get only unique clusters:

gremlin> g.withComputer().V().emit(cyclicPath().or().not(both())).repeat(both()).until(cyclicPath()).
gremlin>   aggregate("p").by(path()).cap("p").unfold().limit(local, 1).map(
gremlin>     __.as("v").select("p").unfold().
gremlin>     filter(unfold().where(eq("v"))).
gremlin>     unfold().dedup().order().by(id).fold()
gremlin>   ).toSet()
==>[v[1], v[2]]
==>[v[0]]

The graph computer now has to process way more paths than necessary and thus it will likely require a ton of memory. I filed a ticket for this issue:

https://issues.apache.org/jira/browse/TINKERPOP-1380

Keep an eye on it if you're going to use the OLAP traversal and once the ticket is solved, use the first traversal, which is a lot more efficient.

Cheers,

Daniel

To view this discussion on the web visit https://groups.google.com/d/msgid/gremlin-users/e31304c5-c1e0-4d5c-8193-1ea8bf5b0291%40googlegroups.com.

Daniel Kuppitz

unread,

Jul 21, 2016, 6:20:42 AM7/21/16

to gremli...@googlegroups.com

I just figured that if we get rid of cap() altogether, everything works perfectly right:

g.withComputer().V().emit(cyclicPath().or().not(both())).repeat(both()).until(cyclicPath()).
aggregate("p").by(path()).limit(1).select("p").unfold().limit(local, 1).dedup().map(

    __.as("v").select("p").unfold().
    filter(unfold().where(eq("v"))).
    unfold().dedup().order().by(id).fold()
).dedup()

All the dedup()'s are in place and not causing any issues. I've also tested it over Jen's larger sample graph and everything looks pretty good:

gremlin> g.withComputer().V().emit(cyclicPath().or().not(both())).repeat(both()).until(cyclicPath()).
gremlin> aggregate("p").by(path()).limit(1).select("p").unfold().limit(local, 1).dedup().map(

gremlin>     __.as("v").select("p").unfold().
gremlin>     filter(unfold().where(eq("v"))).
gremlin>     unfold().dedup().order().by(id).fold()

gremlin> ).dedup().collect { g.V(it).has(label, within("client", "item")).values("name").toList() }

==>[client1, client2, item1, item2]
==>[client3, client4, item3, item4]

Now it's also worth showing how important each dedup() step is:

gremlin> g.withComputer().V().emit(cyclicPath().or().not(both())).repeat(both()).until(cyclicPath()).
gremlin> aggregate("p").by(path()).limit(1).select("p").unfold().limit(local, 1).dedup().map(

gremlin>     __.as("v").select("p").unfold().
gremlin>     filter(unfold().where(eq("v"))).
gremlin>     unfold().dedup().order().by(id).fold()

gremlin>   ).dedup().profile()
==>Traversal Metrics
Step                                                               Count Traversers       Time (ms)    % Dur
=============================================================================================================
GraphStep(vertex,[])                                                  22          22           1.441     0.66
RepeatStep(emit([OrStep([[CyclicPathStep, Profi...                   350         350          36.434    16.76
CyclicPathStep                                                                               0.000
OrStep([[CyclicPathStep, ProfileStep], [NotSt...                                             0.000
    CyclicPathStep                                                                             0.000
    NotStep(![VertexStep(BOTH,edge), ProfileStep])                                             0.000
      VertexStep(BOTH,edge)                                                                    0.000
VertexStep(BOTH,vertex)                                            676         676          27.842
RepeatEndStep (profiling ignored)                                                            0.000
AggregateStep(p,[PathStep, ProfileStep])                             350         350          34.606    15.92
PathStep                                                           350         350           4.717
RangeGlobalStep(0,1)                                                   1           1           0.025     0.01
SelectOneStep(p)                                                       1           1           0.173     0.08
UnfoldStep                                                           350         350          25.711    11.83
RangeLocalStep(0,1)                                                  350         350          65.097    29.95
DedupGlobalStep                                                       22          22          30.955    14.24
TraversalMapStep([StartStep@[v], ProfileStep, S...                    22          22          22.832    10.50
StartStep@[v]                                                       22          22           3.107
SelectOneStep(p)                                                    22          22          15.578
UnfoldStep                                                        7700        7700         384.085
TraversalFilterStep([UnfoldStep, ProfileStep,...                  1572        1572        8937.464
    UnfoldStep                                                     37564       37564        2906.001
    WherePredicateStep(eq(v))                                                                  0.000
UnfoldStep                                                        9440        9440        9342.961
DedupGlobalStep                                                    244         244        9575.249
OrderGlobalStep([[id, incr]])                                      244         244        9927.685
FoldStep                                                            22          22        9952.254
DedupGlobalStep                                                        2           2           0.108     0.05
                                            >TOTAL                     -           -         217.384        -

Cheers,

Daniel

Jen

unread,

Aug 16, 2016, 12:25:48 PM8/16/16

to Gremlin-users

Thanks Daniel for the updated query! I finally have some time now to look at the Gremlin. Most of the Gremlin makes sense to me, except the part in the "map" step. I get that that part is taking all of the cyclic paths and turning them into connected components, but I'm not sure how. Any explanation you can provide would be appreciated.

Jen

Daniel Kuppitz

unread,

Aug 16, 2016, 1:20:13 PM8/16/16

to gremli...@googlegroups.com

Hi Jen,

"p" contains all cyclic paths (and vertices w/o incident edges)
for each distinct start vertex (taken from all paths) do the map()'ing:

label the start vertex "v"
select all cyclic paths
filter those paths that go through vertex "v"
unfold those paths and collect the vertices
those vertices form a cluster / a component
the vertices are then deduplicated and folded into an array and emitted by the map() step

HTH.

Cheers,

Daniel

To unsubscribe from this group and stop receiving emails from it, send an email to gremlin-users+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/gremlin-users/ca068a3a-a077-4b25-b0d9-3e4459fa3701%40googlegroups.com.

Jen

unread,

Aug 17, 2016, 5:05:06 AM8/17/16

to Gremlin-users

Yes that makes sense, thanks!

Shubhangi Agarwal

unread,

Jul 21, 2019, 2:33:58 PM7/21/19

to Gremlin-users

Hey Daniel,

I'm a complete novice to gremlin and only understand few very basic functions (in, out etc). I needed to group vertex-ids by connected components in a very large graph (>11 billion edges), i.e., all vertices that are in the same connected component listed together, one such list for each of the components. It would be great if you could help. Also, withComputer() functionality is not working on my system.

Thanks

Shubhangi

On Tuesday, August 16, 2016 at 10:50:13 PM UTC+5:30, Daniel Kuppitz wrote:

Hi Jen,
"p" contains all cyclic paths (and vertices w/o incident edges)
for each distinct start vertex (taken from all paths) do the map()'ing:
label the start vertex "v"
select all cyclic paths
filter those paths that go through vertex "v"
unfold those paths and collect the vertices
those vertices form a cluster / a component
the vertices are then deduplicated and folded into an array and emitted by the map() step
HTH.

Cheers,
Daniel

To unsubscribe from this group and stop receiving emails from it, send an email to gremli...@googlegroups.com.

Reply all

Reply to author

Forward