gremlin transitive closure 2nd

282 views
Skip to first unread message

Daniel

unread,
Jul 5, 2011, 3:59:12 AM7/5/11
to Gremlin-users
Hi all,

I'm working on a software analysis project where I have a graph which
models source code files and their static dependencies, but also
change sets from a source control system and also the developers
working on that source base. Just as an intro.

And then I found Gremlin and Blueprint and was amazed about the whole
goodness! Thanks for the effort!

With a first query I want to calculate the impact of changing a source
file (the transitive closure). A similar question was already
discussed on this list [0]. The difference is, that I need to loop
until I collected all reachable vertices and not only while (it.loops
< X).

This is my current query:

//results is a Set which I put into scope via jsr223
visited = [] as Set
g.idx('I')
[['name':'A']].as('X').aggregate(results).back(1).outE('link').except(visited).aggregate(visited).back(1).inV.loop('X')
{true}


1. Is this correct? Are there more elegant ways to do this?
2. In my first approach I tracked whether the visited set has changed
to phrase a loop termination condition. But the loop terminates in the
above query because the loop step is not reachable anymore (guarded by
except(visited)). Is this reasoning correct?
3. Although 'aggregate' is described as 'emits input, but adds input
to a collection' I need the back(1) to get the expected behaviour.
Does this relate to https://github.com/tinkerpop/gremlin/issues/207 ?


Again many thanks for this great project! Expect some more questions
in the near future.

Daniel

[0] http://groups.google.com/group/gremlin-users/browse_thread/thread/af9c3dce9bf9920

Marko Rodriguez

unread,
Jul 5, 2011, 2:18:27 PM7/5/11
to gremli...@googlegroups.com
Hi Daniel,

> With a first query I want to calculate the impact of changing a source
> file (the transitive closure). A similar question was already
> discussed on this list [0]. The difference is, that I need to loop
> until I collected all reachable vertices and not only while (it.loops
> < X).
>
> This is my current query:
>
> //results is a Set which I put into scope via jsr223
> visited = [] as Set
> g.idx('I')
> [['name':'A']].as('X').aggregate(results).back(1).outE('link').except(visited).aggregate(visited).back(1).inV.loop('X')
> {true}

> 1. Is this correct? Are there more elegant ways to do this?

Looping and aggregate are not going to behave well together. More on that later.

> 2. In my first approach I tracked whether the visited set has changed
> to phrase a loop termination condition. But the loop terminates in the
> above query because the loop step is not reachable anymore (guarded by
> except(visited)). Is this reasoning correct?

Its dies cause of aggregrate() locking. More on that later.

> 3. Although 'aggregate' is described as 'emits input, but adds input
> to a collection' I need the back(1) to get the expected behaviour.
> Does this relate to https://github.com/tinkerpop/gremlin/issues/207 ?

Yea. I need to rectify this issue. I believe I know how to do it, but haven't gotten around to it. In short, aggregate() will "while(next)" to fill the provided collection (e.g. results). Once its drained the while(next), it will lock and only provide an iterator to the provided collection. As such, when looping, the aggregate() step is locked from the previous loop and thus, this causes problems. ... .or, behavior that is not generally desired.

SIDENOTE: In your example query, you don't need to go back(1) as aggregate just streams the collection provided.

Let me work on that ticket as it has been a problem for others. Keep an eye to the list to see how it develops.

> Again many thanks for this great project! Expect some more questions
> in the near future.

No problem. Keep them coming.

Thanks Daniel,
Marko.

http://markorodriguez.com

Daniel

unread,
Jul 5, 2011, 7:53:41 PM7/5/11
to Gremlin-users
Hi Marko,
many thanks for your explanations. After having a look at the
implementation of AggregatorPipe [0] I think I can see now what
happens. 'aggregate' is much more powerful and different from what I
expected after reading its step description [1] on the github wiki. 

You may want to adjust the description from:
'emits input, but adds input to a collection'
to something along the line:
'adds input to collection, emits collection'

Regarding your side note. It definitely matters if I put the 'back(1)'
after 'aggregate' and now it even seems to make sense to me somehow.
Although 'aggregate' is locked from the second iteration I don't care
by just jumping one step back and go on with the previous iterator..
Okay, I have no clue how back(1) and loop works :) but I will
definitely have a look at it.

Here is a sample session with rexter (cloned today) on the
gratefulgraph:

// using back(1)
visited = [] as Set
result = [] as Set
g.V[['type':'song', 'name':'TELL
MAMA']].as('X').aggregate(result).back(1).outE('followed_by').except(visited).aggregate(visited).back(1).inV.loop('X')
{true}
result.size()
==>338

// without back(1)
visited = [] as Set
result = [] as Set
g.V[['type':'song', 'name':'TELL
MAMA']].as('X').aggregate(result).outE('followed_by').except(visited).aggregate(visited).back(1).inV.loop('X')
{true}
result.size()
==>1


Now after having realized that aggregate behaves different I tried
this:

visited = [] as Set
result = [] as Set
oldCnt = 0
g.V[['type':'song', 'name':'TELL
MAMA']].as('X').sideEffect{result.add(it)}.outE('followed_by').except(visited).sideEffect{visited.add(it)}.inV.loop('X')
{r = (oldCnt != visited.size()); oldCnt = visited.size(); r}
result.size()
==>338

Does it look like a reasonable approach?

Thanks
Daniel


[0] https://github.com/tinkerpop/pipes/blob/master/src/main/java/com/tinkerpop/pipes/sideeffect/AggregatorPipe.java
[1] https://github.com/tinkerpop/gremlin/wiki/Gremlin-Steps

Marko Rodriguez

unread,
Jul 5, 2011, 8:00:05 PM7/5/11
to gremli...@googlegroups.com
Hey,

> many thanks for your explanations. After having a look at the
> implementation of AggregatorPipe [0] I think I can see now what
> happens. 'aggregate' is much more powerful and different from what I
> expected after reading its step description [1] on the github wiki.

Yea. AggregatorPipe is very much a different kind of pipe than the rest. It truly is what makes the Gremlin's depth-first architecture into a breadth-first one.

> You may want to adjust the description from:
> 'emits input, but adds input to a collection'
> to something along the line:
> 'adds input to collection, emits collection'

I'm actually working on a rewrite of AggregatorPipe right now. In short, it will always emit what came through it and, in the background, fill the aggregation collection.

Given the nature of this situation --- you've touched a very critical aspect of Gremlin --- perhaps we can IM and go back and forth on things as I'm doing some work that may be of use to you and would like to iterate on it with you.

My Jabber/XMPP is okram...@gmail.com or skyparko on Skype.

Thanks,
Marko.

http://markorodriguez.com

Marko Rodriguez

unread,
Jul 5, 2011, 8:22:43 PM7/5/11
to gremli...@googlegroups.com
Hi,

So, I've pushed/deployed a new version of AggregatorPipe to GitHub/TinkerPop Maven2. This version is truly a SideEffectPipe in that the incoming objects are the same as the outgoing objects with not alteration to their order, number, etc. As such, the internal aggregation collection is NOT the iterator of the pipe. There are not two internal collections--the user provided collection (aggregate) and a queue that is drained on next(). Moreover, getPath() works correctly as there is yet another queue for the paths.

Finally, AggregatorPipe no longer "locks" when its incoming objects have been drained. The reason this is good is so that when a loop() throws an object back to the beginning, there are still more incoming objects. This works great now save one small caveat that I haven't found a solution to: it.loops in the closure of loop() does not behave correctly. I need to think on this still to find a solution.

As Daniel (and others) have realized, AggregatorPipe is a different kind of beast as it turns a depth-first engine into a breadth-first engine. Moreover, it really breaks from the Pipes model and causes a whole lot of problems that make them self apparent in various situations -- looping, back tracking, paths, etc. We are close to getting it squared away.

I've pushed what I have to Gremlin 1.2-SNAPSHOT. Please feel free to git pull and play.

Thanks,
Marko.

Pierre De Wilde

unread,
Jul 6, 2011, 7:56:00 AM7/6/11
to gremli...@googlegroups.com
Hi Marko,

Thank you for this update.

I modified your example a little bit  (using HashMap instead of Set) to see exactly how the collection was filled:

gremlin> x = []
gremlin> c = 0
==>0
gremlin> g.v(1).out.aggregate(x).loop(2){c++ < 3} >> -1
gremlin> x
==>v[2]
==>v[3]
==>v[4]
==>v[5]
==>v[3]

Breadth-first or depth-first? Cannot figure out. So, rewrote the example like this:

gremlin> x = []                                        
gremlin> c = 0                                         
==>0
gremlin> g.v(1).out('knows','created').aggregate(x).loop(2){c++ < 3} >> -1
gremlin> x
==>v[2]
==>v[4]
==>v[3]
==>v[5]
==>v[3]

OK, it's breadth-first traversal. So, what about depth-first traversal? Tried this:

gremlin> x = []                                                           
gremlin> c = 0                                                            
==>0
gremlin> g.v(1).out('knows','created').sideEffect{x.add(it)}.loop(2){c++ < 3} >> -1
gremlin> x
==>v[2]
==>v[4]
==>v[3]
==>v[5]
==>v[3]

Still breadth-first traversal. Tried this:

gremlin> x = []                                                                    
gremlin> g.v(1).out('knows','created').sideEffect{x.add(it)}.out('knows','created').sideEffect{x.add(it)} >> -1
gremlin> x
==>v[2]
==>v[4]
==>v[5]
==>v[3]
==>v[3]

Now depth-first traversal.

? I don't understand the difference between the last 2 examples.

Thanks,
Pierre


2011/7/6 Marko Rodriguez <okram...@gmail.com>

Marko Rodriguez

unread,
Jul 6, 2011, 10:20:41 AM7/6/11
to gremli...@googlegroups.com
Hi Pierre,

You have stumbled upon another dirty secret. LoopPipe, when complete has an ordering of depth-first, but when in loop the ordering is breadth-first. Here is why.

Loop maintains two iterators --- the incoming pipe and an internal "loop queue." A random example below:

gremlin> g.v(1).both.in.out.loop(2){true}.toString() ==>[BothPipe, LoopPipe[[InPipe, OutPipe]]]

The BothPipe iterator and a Queue (LinkedList) are the Iterators going into LoopPipe. When the last OutPipe emits its value and the loop closure says "loop it" (i.e. returns true), the object is taken from the OutPipe and put into the Queue. The Queue has priority over the BothPipe (a desire to keep it lazy/memory efficient). However, that queued object simply can't be the next thing operated on by InPipe/OutPipe as InPipe/OutPipe still have some elements in them. So the LoopPipe segment needs to flush the remaining objects (filling the queue as appropriate). Finally, at which point, the Queue is popped for its next object (not BothPipe). In essence, this creates breadth-first behavior in the loop line, but depth-first outside the loop.

gremlin> g.v(1).out.loop(1){it.loops < 3} == g.v(1).out.out ==>true
 
What would be the more natural, but (at the time) hard way to code it would be to have the queue immediately run without a "flush" of the LoopPipe. ... Code-wise, it just got overly complicated and as such, LoopPipe is the way it is.

For traversals where there is no state saved internal to the LoopPipe, all is good. But when you use sideEffect{x.add(it}} within a LoopPipe, then this behavior becomes apparent.

In short, there is an impurity in LoopPipe. AggregatorPipe has been rectified, but still does not work correctly with LoopPipe because of this stateful-ness aspect you bring up. There is more to the story with LoopPipe "bundles" (metadata saved in the Queue with the object -- it.object, it.loops, it.path) that complicate things further.

Anywho, a purer (depth-first all the way) LoopPipe that is efficient would be an excellent contribution to Gremlin.

Thanks for the honest eyes Pierre,
Marko.

Daniel

unread,
Jul 25, 2011, 4:05:28 PM7/25/11
to Gremlin-users
Hi all,

in the meantime I tried to implement LoopPipe myself. I omitted
everything related to paths. Here is a version which passes 3 of the 4
LoopPipeTests (all except the path related test).

My "solution" is inspired by
http://lingpipe-blog.com/2009/01/27/quiz-answer-continuation-passing-style-for-converting-visitors-to-iterators/
. I'm also sure I could save some lines by using AbstractPipe. Here is
the code:

public class LoopPipe2<S> implements Pipe<S, S>, MetaPipe {

private final Closure closure;
private final Pipe<S, S> pipe;
protected Iterator<S> starts;

private S nextResult;
private boolean searched = false;
private final Stack<LoopContext> remainingInputs = new
Stack<LoopContext>();

public LoopPipe2(final Pipe<S, S> pipe, final Closure closure) {
this.pipe = pipe;
this.closure = closure;
this.closure.setDelegate(this);
}

private void checkSearched() {
if(!searched) {
nextResult = findNext();
searched = true;
}
}

public boolean hasNext() {
checkSearched();
return nextResult != null;
}

public S next() {
if (!hasNext()) throw new NoSuchElementException();
searched = false;
return nextResult;
}

/* Applies the given pipe as a funtion. I need to drain the output
iterator
to be allowed to reuse it in deeper iterations. */
private Iterator<S> applyPipe(Pipe<S,S> p, S elem) {
p.reset();
p.setStarts(elemAsIterator(elem)); // feed pipe with first
element
ArrayList<S> kids = new ArrayList<S>(); // results
for(S s: p){ kids.add(s); } //drain pipe
return kids.iterator(); // result iterator
}

/* I'm sure this can be done more efficient =) */
private Iterator<S> elemAsIterator(S s) {
return Arrays.asList(s).iterator();
}

/* As I wrote, I didn't care about the path. */
private List FAKE_PATH() { return new LinkedList(Arrays.asList(new
Object()));}; // needs at least one element

private S findNext() {
while(!remainingInputs.isEmpty()) {
LoopContext c = remainingInputs.peek();
if(c.elems.hasNext()) {
S nextElem = c.elems.next();
if (!(Boolean) closure.call(new Bundle<S>(nextElem,
FAKE_PATH(), c.loops))) {
return nextElem; // return path to nextElem as
well?
}
Iterator<S> kidsIt = applyPipe(pipe, nextElem); // do
one step (remember path somehow)
remainingInputs.push(new LoopContext(kidsIt, c.loops +
1)); // add path to context
} else {
remainingInputs.pop();
}
}
return null;
}

private class LoopContext {
// here would be the place to save the path
final int loops;
final Iterator<S> elems;
public LoopContext(Iterator<S> elems, int loops) { this.elems =
elems; this.loops = loops;}
}

public String toString() { return super.toString() + "[" +
this.pipe + "]"; }

public List<Pipe> getPipes() { return (List) Arrays.asList(pipe); }
public Iterator<S> iterator() { return this; }

public void setStarts(final Iterator<S> starts) { this.starts =
starts; remainingInputs.push(new LoopContext(starts, 1)); }
public void setStarts(final Pipe<?, S> starts)
{ this.setStarts(starts.iterator()); }
public void setStarts(final Iterable<S> starts)
{ this.setStarts(starts.iterator()); }

public List getPath() { return Collections.emptyList(); } // Not
working

public void reset() { throw new UnsupportedOperationException("No
idea what to do here"); }
public void remove() { throw new UnsupportedOperationException(); }
}

And here a few observations:
1. It's not easy to compose Pipes because of the stateful Iterator in/
output.
2. I didn't really understand when I'm allowed to reset a pipe.
3. I copied "public Iterator<S> iterator() { return this; }" from
AbstractPipe but I think the intended behaviour would be to return a
new Iterator each time it is called.

Anyway since I prefer scala over groovy we started to implement a
gremlin like graph traversal dsl embedded in scala. Maybe the biggest
difference is that we convert the Vertex#getOutEdges Iterable to a
Stream (immutable lazy Sequence). Because of this we can freely write
and compose pure functions over this datastructure. I have to check
with our customer if we are allowed to opensource it.

Here is an example snippet:
val res = g.v(3).outE("knows").inV.where(_.outE("knows").inV("name" →
"caro"))

One advantage is, that code like the following is rejected by the
compiler
g.v(3).outE("knows").outE

We will see in which direction this all goes. Let me know what you
think about all that.

Thanks Daniel
> > gremlin> g.v(1).out('knows','created').sideEffect{x.add(it)}.out('knows','created'). sideEffect{x.add(it)} >> -1
> > gremlin> x
> > ==>v[2]
> > ==>v[4]
> > ==>v[5]
> > ==>v[3]
> > ==>v[3]
>
> > Now depth-first traversal.
>
> > ? I don't understand the difference between the last 2 examples.
>
> > Thanks,
> > Pierre
>
> > 2011/7/6 Marko Rodriguez <okramma...@gmail.com>
> >> MAMA']].as('X').aggregate(result).back(1).outE('followed_by').except(visite d).aggregate(visited).back(1).inV.loop('X')
> >> {true}
> >> result.size()
> >> ==>338
>
> >> // without back(1)
> >> visited = [] as Set
> >> result = [] as Set
> >> g.V[['type':'song', 'name':'TELL
> >> MAMA']].as('X').aggregate(result).outE('followed_by').except(visited).aggre gate(visited).back(1).inV.loop('X')
> >> {true}
> >> result.size()
> >> ==>1
>
> >> Now after having realized that aggregate behaves different I tried
> >> this:
>
> >> visited = [] as Set
> >> result = [] as Set
> >> oldCnt = 0
> >> g.V[['type':'song', 'name':'TELL
> >> MAMA']].as('X').sideEffect{result.add(it)}.outE('followed_by').except(visit ed).sideEffect{visited.add(it)}.inV.loop('X')
> >> {r = (oldCnt != visited.size()); oldCnt = visited.size(); r}
> >> result.size()
> >> ==>338
>
> >> Does it look like a reasonable approach?
>
> >> Thanks
> >> Daniel
>
> >> [0]https://github.com/tinkerpop/pipes/blob/master/src/main/java/com/tink...
> >> [1]https://github.com/tinkerpop/gremlin/wiki/Gremlin-Steps
Reply all
Reply to author
Forward
0 new messages