Dedup a list (Tinkerpop 3.2.1)

191 views
Skip to first unread message

Jen

unread,
Oct 7, 2016, 11:12:02 AM10/7/16
to Gremlin-users
Hi,

I am working on a query where I need to dedup() the result of a group() statement. When I add a dedup() at the end, it does nothing. For example, on the Tinkerpop-modern graph (3.2.1 OLAP Gremlin with Spark):
g.V().as("v").both().both().both().as("bothV").select("v","bothV").group().by(select("v")).by(select("bothV").unfold().dedup().order().by(id).fold()).unfold().select(values)
==>[v[1], v[2], v[3], v[4], v[5], v[6]]
==>[v[1], v[2], v[3], v[4], v[5], v[6]]
==>[v[1], v[2], v[3], v[4], v[5], v[6]]
==>[v[1], v[3], v[4], v[5], v[6]]
==>[v[1], v[2], v[3], v[4], v[6]]
==>[v[1], v[2], v[3], v[4], v[5]]

I thought that if I added a dedup() at the end, it would return:
g.V().as("v").both().both().both().as("bothV").select("v","bothV").group().by(select("v")).by(select("bothV").unfold().dedup().order().by(id).fold()).unfold().select(values).dedup()
==>[v[1], v[2], v[3], v[4], v[5], v[6]]
==>[v[1], v[3], v[4], v[5], v[6]]
==>[v[1], v[2], v[3], v[4], v[6]]
==>[v[1], v[2], v[3], v[4], v[5]]


But it returns the same as without the dedup() above. Am I missing something? I don't really want to do a toSet() without deduping first, because the related query I'm working on returns quite a number of duplicated results and I don't want to overwhelm the Spark driver with a huge list in memory.

-Jen

HadoopMarc

unread,
Oct 8, 2016, 5:40:08 AM10/8/16
to Gremlin-users
Hi Jen,

I thinks dedup() as a filter() step only gets executed at the Spark executor level, not globally. So, try the query with 1 executor again. Possibly, in a multi executor setup the final dedup() will work better when you precede it by an order().by(range(local, 0, 1)). Not tested (perhaps need unfod().id() if Verteices are not comparable), see also:

http://tinkerpop.apache.org/docs/current/reference/#distributed-gremlin-gotchas

Cheers,    Marc

Op vrijdag 7 oktober 2016 17:12:02 UTC+2 schreef Jen:

HadoopMarc

unread,
Oct 8, 2016, 2:18:59 PM10/8/16
to Gremlin-users
Hi Jen,

I am afraid my previous answer was partly incorrect (read partitions instead of executors), although the dedup() step does not work globally for OLAP. Probably a better way to simulate a dedup over all partitions/traversers is to replace it by the following:

  ............group().by().by(count()).select(keys).unfold()

If this works properly (not tested), it could be even nice to have a shorthand for it, something like dedup(Scope.allTraversers).


Cheers,   Marc

Op vrijdag 7 oktober 2016 17:12:02 UTC+2 schreef Jen:
Hi,

Jen

unread,
Oct 10, 2016, 4:52:31 AM10/10/16
to Gremlin-users
Thanks Marc! I didn't realize the dedup() step didn't work globally for OLAP. Your suggested group().by() alternative works for me.

Jen

Daniel Kuppitz

unread,
Oct 10, 2016, 6:45:35 AM10/10/16
to gremli...@googlegroups.com
Yea, groupCount() seems to be a "good" workaround.

gremlin> g.V().as("v").both().both().both().as("bothV").select("v","bothV").
           group().by(select("v")).by(select("bothV").dedup().fold()).select(values).unfold().
           groupCount().select(keys).unfold()

==>[v[1],v[4],v[5],v[3],v[6]]
==>[v[1],v[2],v[4],v[5],v[3]]
==>[v[1],v[2],v[4],v[3],v[6]]
==>[v[1],v[2],v[4],v[5],v[3],v[6]]

However, OLAP's .dedup() looks buggy to me.

Cheers,
Daniel


--
You received this message because you are subscribed to the Google Groups "Gremlin-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gremlin-users+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/gremlin-users/4feac9d6-e507-4f24-ae42-8baa41d52a84%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Marko Rodriguez

unread,
Oct 10, 2016, 11:56:22 AM10/10/16
to gremli...@googlegroups.com
Hi,

So DedupGlobalStep is a Barrier step (thus, a global barrier — not a local barrier). Looking at the code, DedupGlobalStep has a very complex implementation due to its ability to dedup() on labels — e.g. dedup(“a","b”). I’m wondering if in doing that work, we have incurred some buggy behavior for the simpler situation of just dedup().

If someone could submit a JIRA issue with the simplest possible query that produces the bug over one of our toy graphs (modern, grateful, crew, etc.), that would be great.

Thank you,
Marko.
To unsubscribe from this group and stop receiving emails from it, send an email to gremlin-user...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/gremlin-users/CA%2Bf9seXNk0QZARgcnYssdwVLS4UNx%3DnuU%3DeLWX-hNGSi3EEXwg%40mail.gmail.com.

Marko Rodriguez

unread,
Oct 10, 2016, 11:59:02 AM10/10/16
to gremli...@googlegroups.com
Hi,

Also as a side. I just realized you are trying to dedup a list. Do you know about dedup(local) which will dedup local data structures like list, set, map, etc.

Marko.

Daniel Kuppitz

unread,
Oct 11, 2016, 5:03:13 AM10/11/16
to gremli...@googlegroups.com
If someone could submit a JIRA issue with the simplest possible query that produces the bug over one of our toy graphs (modern, grateful, crew, etc.), that would be great.

Unfortunately I haven't found a simpler traversal.

Cheers,
Daniel


--
You received this message because you are subscribed to the Google Groups "Gremlin-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gremlin-users+unsubscribe@googlegroups.com.

Marko Rodriguez

unread,
Oct 11, 2016, 6:41:10 AM10/11/16
to gremli...@googlegroups.com
Eek. That is a nasty traversal. There must be a simpler traversal. 

Its going to be hard to find the root of the problem with so many steps interacting like this.

:|

Marko.
To unsubscribe from this group and stop receiving emails from it, send an email to gremlin-user...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/gremlin-users/CA%2Bf9seV0P_io2nW3gmkN9Y2FLAoGbyqSuKw3nLqLVcbEKiCrkg%40mail.gmail.com.

HadoopMarc

unread,
Oct 11, 2016, 9:06:11 AM10/11/16
to Gremlin-users

This one looks a lot  simpler:

gremlin> graph = GraphFactory.open('conf/hadoop/hadoop-gryo.properties')
==>hadoopgraph[gryoinputformat->gryooutputformat]
gremlin> g = graph.traversal(computer(SparkGraphComputer))
==>graphtraversalsource[hadoopgraph[gryoinputformat->gryooutputformat], sparkgraphcomputer]
gremlin> g.V().groupCount().select(values).unfold().dedup()
==>1
==>1
==>1
==>1
==>1
==>1

In TinkerGraph this gives a single "1". OLAP via remote connection also fails.

I will also put this in the comments on JIRA.

Cheers,    Marc

Op dinsdag 11 oktober 2016 12:41:10 UTC+2 schreef Marko A. Rodriguez:
To unsubscribe from this group and stop receiving emails from it, send an email to gremlin-user...@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "Gremlin-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gremlin-user...@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "Gremlin-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gremlin-user...@googlegroups.com.

Marko Rodriguez

unread,
Oct 11, 2016, 11:14:58 AM10/11/16
to gremli...@googlegroups.com
Hi,

I figured out the problem and have a PR that fixes it.


This should get into TinkerPop 3.2.3.

Thanks for helping isolate the issue,
Marko.
Reply all
Reply to author
Forward
0 new messages