shortest path with query-specified dynamic path costs

shapeshifter

unread,

Apr 24, 2019, 12:49:18 PM4/24/19

to Gremlin-users

I'm trying to do a shortest path query utilising path cost rather than edge count (something like A* I guess). The shortest path recipe found at http://tinkerpop.apache.org/docs/current/recipes/#shortest-path would be good enough normally, but for my task the individual costs need to be determined at run-time by the query itself. To elaborate...

Certain edges have a property we'll call 'direction', and its value will be one of 'north', 'south', 'east', 'west'. It's important this property value is kept as a string, but if it helps the query, I'm happy to add a shadow numeric and normalised property like direction_v=1

Each query specifies the starting node, edge type to follow (only one type will have this property anyway), target node and the query-specific cost values (eg. north=3, south=2, east=1 and west=2). The returned result needs to contain the shortest path and total cost.

I've searched for similar problems/solutions, and the closest I've found is https://stackoverflow.com/questions/45914862/how-to-sum-the-weight-with-coefficient-in-janusgraph

I don't yet know enough gremlin (and certainly not groovy where that was given) to know if I can adapt any of those answers to my problem.

As an added complication, given that I can do the above, I really need to run (or wrap) the query with(in) multiple starting nodes (otherwise I will be running the same query with a different starting node multiple times and choosing the overall shortest path).

Thanks in advance

Richard

Daniel Kuppitz

unread,

Apr 24, 2019, 1:26:54 PM4/24/19

to gremli...@googlegroups.com

Let's just take the graph from the short path recipe and add some direction values.

gremlin> r = new Random()
==>java.util.Random@35d5ac51
gremlin> directions = ['north','south','east','west']
==>north
==>south
==>east
==>west

gremlin> g = TinkerGraph.open().traversal()==>graphtraversalsource[tinkergraph[vertices:0 edges:0], standard]
gremlin> g.addV().property(id, 1).as('1').
......1> addV().property(id, 2).as('2').
......2> addV().property(id, 3).as('3').
......3> addV().property(id, 4).as('4').
......4> addV().property(id, 5).as('5').
......5> addE('connection').from('1').to('2').property('direction', directions[r.nextInt(directions.size())]).
......6> addE('connection').from('2').to('4').property('direction', directions[r.nextInt(directions.size())]).
......7> addE('connection').from('4').to('5').property('direction', directions[r.nextInt(directions.size())]).
......8> addE('connection').from('2').to('3').property('direction', directions[r.nextInt(directions.size())]).
......9> addE('connection').from('3').to('4').property('direction', directions[r.nextInt(directions.size())]).iterate()

Let's see the paths from V(1) to V(5) (of course, they are the same as in the recipe).

gremlin> g.V(1).
......1> repeat(outE().inV().simplePath()).
......2> until(hasId(5)).
......3> path()
==>[v[1],e[0][1-connection->2],v[2],e[1][2-connection->4],v[4],e[2][4-connection->5],v[5]]
==>[v[1],e[0][1-connection->2],v[2],e[3][2-connection->3],v[3],e[4][3-connection->4],v[4],e[2][4-connection->5],v[5]]

gremlin> g.V(1).
......1> repeat(outE().inV().simplePath()).
......2> until(hasId(5)).
......3> path().
......4> by().
......5> by('direction')
==>[v[1],north,v[2],east,v[4],east,v[5]]
==>[v[1],north,v[2],east,v[3],west,v[4],east,v[5]]

Now here comes the interesting part for you: adding the cost value for each direction:

gremlin> cost = ['north': 3, 'south': 2, 'east': 1, 'west': 2]
==>north=3
==>south=2
==>east=1
==>west=2
gremlin> g.withSideEffect('cost', cost).V(1).
......1> repeat(outE().inV().simplePath()).
......2> until(hasId(5)).
......3> path().
......4> by().
......5> by(values('direction').as('d').
......6> union(identity(),
......7> select('cost').select(select('d'))).fold())
==>[v[1],[north,3],v[2],[east,1],v[4],[east,1],v[5]]
==>[v[1],[north,3],v[2],[east,1],v[3],[west,2],v[4],[east,1],v[5]]

Now that you know how to map a direction to its respective cost value, it's probably already clear to you how to track the path's overall cost. If not, sack() is the key:

gremlin> g.withSack(0).
......1> withSideEffect('cost', cost).V(1).
......2> repeat(outE().sack(sum).
......3> by(values('direction').as('d').
......4> select('cost').select(select('d'))).
......5> inV().simplePath()).
......6> until(hasId(5)).
......7> path().
......8> by().
......9> by(values('direction').as('d').
.....10> union(identity(),
.....11> select('cost').select(select('d'))).fold()).
.....12> project('path','cost').
.....13> by().
.....14> by(sack())
==>[path:[v[1],[north,3],v[2],[east,1],v[4],[east,1],v[5]],cost:5]
==>[path:[v[1],[north,3],v[2],[east,1],v[3],[west,2],v[4],[east,1],v[5]],cost:7]

As you can see, the final sack value contains the overall costs, hence the shortest path will be the one with the lowest sack value:

gremlin> g.withSack(0).
......1> withSideEffect('cost', cost).V(1).
......2> repeat(outE().sack(sum).
......3> by(values('direction').as('d').
......4> select('cost').select(select('d'))).
......5> inV().simplePath()).
......6> until(hasId(5)).
......7> path().
......8> by().
......9> by(values('direction').as('d').
.....10> union(identity(),
.....11> select('cost').select(select('d'))).fold()).
.....12> order().
.....13> by(sack(), incr).
.....14> limit(1)
==>[v[1],[north,3],v[2],[east,1],v[4],[east,1],v[5]]

gremlin> g.withSack(0).
......1> withSideEffect('cost', cost).V(1).
......2> repeat(outE().sack(sum).
......3> by(values('direction').as('d').
......4> select('cost').select(select('d'))).
......5> inV().simplePath()).
......6> until(hasId(5)).
......7> path().
......8> order().
......9> by(sack(), incr).
.....10> limit(1)
==>[v[1],e[0][1-connection->2],v[2],e[1][2-connection->4],v[4],e[2][4-connection->5],v[5]]

gremlin> g.withSack(0).
......1> withSideEffect('cost', cost).V(1).
......2> repeat(flatMap(outE().sack(sum).
......3> by(values('direction').as('d').
......4> select('cost').select(select('d'))).
......5> inV()).simplePath()).
......6> until(hasId(5)).
......7> path().
......8> order().
......9> by(sack(), incr).
.....10> limit(1)
==>[v[1],v[2],v[4],v[5]]

I really need to run (or wrap) the query with(in) multiple starting nodes

Adding multiple start vertices is as simple as changing V(1) to V(1,2,3) or whatever.

Cheers,

Daniel

--
You received this message because you are subscribed to the Google Groups "Gremlin-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gremlin-user...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/gremlin-users/8a88fdea-75a8-4d15-aa90-4997afc42965%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

shapeshifter

unread,

May 2, 2019, 7:00:53 AM5/2/19

to Gremlin-users

Thank you so much, Daniel! A fantastically detailed answer that I've learnt so much from by dissecting each part.

To unsubscribe from this group and stop receiving emails from it, send an email to gremli...@googlegroups.com.

shapeshifter

unread,

May 8, 2019, 12:52:34 PM5/8/19

to Gremlin-users

Mind if I get a little extra help on this please?

The traversal below solves the original query, but I quickly realised that it was iterating every path in full even when it had already exceeded the best (minimum) path cost already found.

The "obvious" way to optimise seemed to be to modify the until() to also terminate if the current path cost had exceeded the best found so far. So, (taking out the existing hasId() check for clarity), something like..

until(sack().tail(1).is(gte(sack().min())))

but I get "..DefaultGraphTraversal cannot be cast to java.lang.Integer" errors, so how exactly do I express this?

I've spent 2 full days pulling my hair out trying to do this with no joy, please put me out of my misery!

Daniel Kuppitz

unread,

May 9, 2019, 9:20:30 AM5/9/19

to gremli...@googlegroups.com

Untested:

g.withSack(0).
withSideEffect('cost', cost).V(1).
repeat(flatMap(outE().sack(sum).

by(values('direction').as('d').

select('cost').select(select('d'))).

or(__.not(select('m').unfold()),
select('m').select('cost').
project('current','min').
by(sack()).
by().
where('current', lt('min'))).
inV()).simplePath()).
until(hasId(5).group('m').by(constant('cost')).by(sack().min())).
path().
limit(1)

You need to keep track of the minimum cost value and then compare the current cost value against the current minimum cost value (if one exists, yet) inside repeat(). If the current cost value is >= the current minimum, let the traverser die. Note, that this can only work for non-negative cost values. If any cost value can ever be negative, you'll have to evaluate all paths.

Cheers,

Daniel

To unsubscribe from this group and stop receiving emails from it, send an email to gremlin-user...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/gremlin-users/2ff4ec55-fd2c-4b58-ba6a-49a136c36f8b%40googlegroups.com.

shapeshifter

unread,

May 23, 2019, 10:57:44 AM5/23/19

to Gremlin-users

Hi Daniel,

Thanks for your solution. I understand the principle (although the syntax took a lot more puzzling about), however, this gives an error:

java.util.LinkedHashMap cannot be cast to java.lang.Integer

It would seem in the where clause that 'min' is just the cost map itself. I've tried many ways to try and correct this, but have drawn a blank (even hard-coding an int inside lt() gives an 'int can't be case to string' error)

Would you able to amend this please?

I have a situation now with my own graph where my queries are timing out (3 mins on my system). Disregarding path COST and just considering hop count for a moment, I have found a path from a given A to B in about 15 secs using trial and error (i.e. crippling the query with pre-selected values in a times() step). Anything more than about 18 steps and timeouts occur, so I really need to be able to cut the walking time by not continuing with any paths already too long

As an aside, I've changed the original query to avoid using path() as

a) I do not want the edges returned and cannot work out how to skip them (I also hear path() is a cpu hog)

b) where multiple paths are joint-shortest I need all of them, not just one

// this just uses "modern" graph with added direction properties on all of the edges

gm.withSack(0).withSideEffect('cost', cost).V().as('a').

repeat(outE().sack(sum).by(values('direction').as('dt').select('cost').select(select('dt'))).

inV().as('a').simplePath()).

until(hasId(within(3,4,5))).

select(all, 'a').map(unfold().id().fold()).group().by(sack()).unfold().

order().by(keys, incr) //.limit(1)

==>1=[[4, 5], [4, 3]]

==>100=[[1, 4]]

==>10000=[[1, 3], [6, 3]]

// but because I want ALL the shortest paths and don't really care about the actual cost itself

order().by(keys, incr).limit(1).select(values).unfold()

==>[4,5]

==>[4,3]

Does that look ok? It works fine on my own graph, I'm just crippled by having to evaluate every possible path in full

Thanks in advance

p.s. is it possible to do an initialising group('m') step before going into the repeat loop to avoid doing that or()...unfold() test on every iteration?

Daniel Kuppitz

unread,

May 23, 2019, 1:50:26 PM5/23/19

to gremli...@googlegroups.com

I do not want the edges returned and cannot work out how to skip them

What you did is the easiest way to get rid of the edges.

I also hear path() is a cpu hog

It's not about the path() step, it's more about having path tracking enabled (which you have anyway, as you're using simplePath()).

Does that look ok? It works fine on my own graph, I'm just crippled by having to evaluate every possible path in full

Yeah, looks good. There's no way around the evaluation of every path that when you use a cost-based algorithm. The longest path can be found much earlier than the shortest one.

is it possible to do an initialising group('m') step before going into the repeat loop to avoid doing that or()...unfold() test on every iteration?

Not sure why I didn't do that in my initial example, but something like this should work (make the unfold() check unnecessary):

.withSideEffect("m", ["cost":Integer.MAX_VALUE])

Cheers,

Daniel

To unsubscribe from this group and stop receiving emails from it, send an email to gremlin-user...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/gremlin-users/9e4396bf-54b7-418a-8970-68b547df224c%40googlegroups.com.

Richard Moss

unread,

May 23, 2019, 8:10:06 PM5/23/19

to gremli...@googlegroups.com

Thanks for your reply, Daniel- it's good to know I'm on the right track.

One thing you didn't respond on though is my biggest problem- the error caused when the where clause is tested:

Daniel Kuppitz

unread,

May 24, 2019, 12:39:32 AM5/24/19

to gremli...@googlegroups.com

I'll need a sample graph and the exact query to reproduce the issue. The queries that I posted didn't throw any exceptions during my tests.

Cheers,

Daniel

--

You received this message because you are subscribed to the Google Groups "Gremlin-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gremlin-user...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/gremlin-users/CAD4-su%2BwaBwJeruLStkm8Wkxj04kkC_jej-oG88GFugxELSySg%40mail.gmail.com.

Daniel Kuppitz

unread,

May 24, 2019, 12:41:38 AM5/24/19

to gremli...@googlegroups.com

Sorry, strike that, I just realized that my last query was untested.

Checking....

Daniel Kuppitz

unread,

May 24, 2019, 1:07:32 AM5/24/19

to gremli...@googlegroups.com

Tested:

g.withSack(0).
withSideEffect('m', ['cost': Integer.MAX_VALUE]).

withSideEffect('cost', cost).V().as('a').
repeat(outE().sack(sum).

by(values('direction').as('d').
select('cost').select(select('d'))).

filter(select('m').select('cost').

project('current','min').
by(sack()).
by().
where('current', lt('min'))).

inV().simplePath().as('a')).
until(hasId(5).group('m').by(constant('min')).by(sack().min())).
group().
by(sack()).
by(select(all, 'a').fold()).
select(select('m').select('min')).unfold()

Cheers,

Daniel

shapeshifter

unread,

May 24, 2019, 12:47:24 PM5/24/19

to Gremlin-users

Hi Daniel,

I didn't declare that I was using Janusgraph (itself using tinkerpop 3.3.3). I've just run it very quickly in a new, local TP 3.4.1 installation and can confirm it runs without the hashmap error (what has changed in between versions?). Anyway, as I still have a question or 2 about how and what it's doing, here's the graph data and a run using your latest traversal.

graphmod = TinkerFactory.createModern()

gm = graphmod.traversal()

gm.E(7).property("direction", "above")

gm.E(8).property("direction", "side")

gm.E(9).property("direction", "below")

gm.E(10).property("direction", "above")

gm.E(11).property("direction", "side")

gm.E(12).property("direction", "below")

cost = ["above":10000,"side":100,"below":1]

This is virtually the same traversal you provided in your last post:

gm.withSack(0).withSideEffect('m', ['cost': Integer.MAX_VALUE]).withSideEffect('cost', cost).V().as('a').

repeat(outE().sack(sum).by(values('direction').as('d').select('cost').select(select('d'))).

filter(select('m').select('cost').

project('current','min').

by(sack()).

by().

where('current', lte('min'))). // changed to lte as I want ALL the shortest paths matching the first one found, not just that first one itself

inV().simplePath().as('a')).

until(hasId(3).group('m').by(constant('min')).by(sack().min())).

union(path(), sack())

==>[v[1],e[9][1-created->3],v[3]]

==>1

==>[v[1],e[8][1-knows->4],v[4],e[11][4-created->3],v[3]]

==>200

==>[v[4],e[11][4-created->3],v[3]]

==>100

==>[v[6],e[12][6-created->3],v[3]]

==>1

I expected (hoped) that I would only see the first and last entry here as the second and third would not have passed the where() test. However, that's assuming these traversals are calculated linearly, and this may be a lack of understanding on my part about how traversers work- in parallel or serially? - I'm guessing they would all advance one step at a time until all have finished rather than one whole traversal at a time? In that case, perhaps this graph is just too small to see the result I want? But the bottom line is, I can only work on my larger graph if paths cease to be expanded when they exceed the current minimum (i.e. least-cost so far). I won't complain too much if this doesn't actually happen until the next step/loop, as evaluating one edge too many on every path is better than trying to exhaust each path for a target it cannot find in any reasonable time.

Thanks so far

Daniel Kuppitz

unread,

May 24, 2019, 2:02:27 PM5/24/19

to gremli...@googlegroups.com

what has changed in between versions?

My bet was the nested select() step, but this should be in 3.3.3: https://issues.apache.org/jira/browse/TINKERPOP-1628

I expected (hoped) that I would only see the first and last entry here as the second and third would not have passed the where() test.

Ugh, I swear, this time I'm double-checking that I post the correct query:

gm.withSack(0).
withSideEffect('m', ['min': Integer.MAX_VALUE]).

withSideEffect('cost', cost).V().as('a').

repeat(outE().sack(sum).
by(values('direction').as('d').

select('cost').select(select('d'))).

filter(select('m').select('min').

project('current','min').
by(sack()).
by().
where('current', lte('min'))).

inV().simplePath().as('a')).
until(hasId(3).group('m').by(constant('min')).by(sack().min())).

map(union(path(), sack()).fold())

Output:

gremlin> gm.withSack(0).
......1> withSideEffect('m', ['min': Integer.MAX_VALUE]).
......2> withSideEffect('cost', cost).V().as('a').
......3> repeat(outE().sack(sum).
......4> by(values('direction').as('d').
......5> select('cost').select(select('d'))).
......6> filter(select('m').select('min').
......7> project('current','min').
......8> by(sack()).
......9> by().
.....10> where('current', lte('min'))).
.....11> inV().simplePath().as('a')).
.....12> until(hasId(3).group('m').by(constant('min')).by(sack().min())).
.....13> map(union(path(), sack()).fold())
==>[[v[1],e[9][1-created->3],v[3]],1]
==>[[v[6],e[12][6-created->3],v[3]],1]

I had this query yesterday but somehow ended up posting my old clipboard content here :/...

Cheers,

Daniel

--

You received this message because you are subscribed to the Google Groups "Gremlin-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gremlin-user...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/gremlin-users/dd8bb9ed-3a64-4f2a-845c-0d0ac8fc119b%40googlegroups.com.

shapeshifter

unread,

May 29, 2019, 8:13:37 AM5/29/19

to Gremlin-users

Hi Daniel,

Thanks for your help again. The good news is that this works and I am able to find solutions that were previously timing out. However, the time taken on my graph ranges from a second when solution(s) are found around 9 edges out to nearly a minute when a solution is at least 18 edges out. I'm looking for something that is closer to a second each time! There are only around 4000 vertices in this particular graph, so for other graphs it could be a huge problem!

Here's the tail end of a profile showing where the time was lost in that slower traversal (the values for the easier query are approximately 1/100th the value for each major step):

SackValueStep(sum,[JanusGraphPropertiesStep([... 1582816 1582816 18466.441

JanusGraphPropertiesStep([DependencyType],v... 1582816 1582816 7648.156

SelectOneStep(last,cost) 1582816 1582816 1891.864

TraversalSelectStep([SelectOneStep(last,d)]) 1582816 1582816 4942.446

TraversalFilterStep([SelectOneStep(last,m), P... 404393 404393 20234.997

SelectOneStep(last,m) 1582816 1582816 3234.110

SelectOneStep(last,min) 1582816 1582816 3052.630

ProjectStep([current, min],[[SackStep, Prof... 1582816 1582816 7969.661

SackStep 1582816 1582816 1200.949

WherePredicateStep(current,lte(min)) 3135.707

EdgeVertexStep(IN) 404393 404393 613.888

PathFilterStep(simple)@[a] 247777 247777 2279.732

RepeatEndStep 2 2 1904.180

TraversalMapStep([UnionStep([[PathStep, Profile... 2 2 15.625 0.03

UnionStep([[PathStep, ProfileStep, EndStep, P... 4 4 0.076

PathStep 2 2 0.013

EndStep 2 2 0.012

SackStep 2 2 0.003

EndStep 2 2 0.004

FoldStep 2 2 5.346

>TOTAL - - 53267.709

Looking at the graph as a tree, I guess the problem is that the traversal is effectively doing a breadth-first search, so when it's hit a target at the 18th edge on the given path, every other possible path has already been expanded up to at least the 17th edge(?). At that point we set the 'min' constant and short-circuit other paths that exceed this, but most of the damage has been done in that most if not all of those paths exceeded this min cost some time before.

I hoped to optimise this with some kind of pre-query 'greedy search' that would set a reasonable (possibly optimal) initial 'min' value, but I don't think it could work. Shame, as I re-ran the 52 sec query replacing Integer.MAX_VALUE with a value I knew was only slighter bigger than the optimal one, and it brings the time down to a reasonable 2 secs.

So that leaves me looking to implement something like dijkstra's shortest path algorithm. I wouldn't know where to start in Gremlin. I think it would require data structures that could have entries removed/popped, not just added/pushed as sack() and store() seem to be. Sad thing is, I could do this quite easily in something like C# by quickly sucking in the 4000 nodes, but then what is the point in having a graph database in the first place?

Any ideas, please?

Daniel Kuppitz

unread,

May 29, 2019, 4:19:52 PM5/29/19

to gremli...@googlegroups.com

Looking at the graph as a tree

If it's a tree, then your traversal should start at the leaf nodes, because then the maximum number of paths equals the number of start vertices.

I guess the problem is that the traversal is effectively doing a breadth-first search

You can run into the same problem if it would be DFS (if a target is only found in any of the last possible paths).

so when it's hit a target at the 18th edge on the given path, every other possible path has already been expanded up to at least the 17th edge(?).

That's right, but who guarantees that you would hit earlier using DFS?

data structures that could have entries removed/popped, not just added/pushed as sack() and store() seem to be.

Yeah, that's a big thing missing in Gremlin.

Any ideas, please?Well, if it's really a tree, then starting from the opposite side should give you your desired sub-second results.

Cheers,

Daniel

--

You received this message because you are subscribed to the Google Groups "Gremlin-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gremlin-user...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/gremlin-users/4ade6e91-40a1-4159-b4a7-746c1bb30834%40googlegroups.com.

Reply all

Reply to author

Forward