Using Gremlin on Azure Cosmos DB, how do you efficiently get edges excluding a previously stored list of edges?

649 views
Skip to first unread message

Harry O'Connor

unread,
Jul 25, 2019, 4:10:35 PM7/25/19
to Gremlin-users
From a vertex I want to be able to get its edges excluding ones I have previously stored in a list.


Using .where() in Cosmos is inefficient, for instance when I examine queries using .executionProfile() :

g
.V().has("id", "1").outE("Link").where(values("Num").is(gt(100))).inV()
"name": "GetEdges","resultCount": 20
"name": "GetNeighborVertices", "resultCount": 5


g
.V().has("id", "1").outE("Link").has("Num",gt(100)).inV()
"name": "GetEdges", "resultCount": 5
"name": "GetNeighborVertices" "resultCount": 5

In this case .where() results in 15 unnecessary GetEdge steps.


In my case I have used store("idList") for id's "1","2" and "3". Now I am on a state with edges "1","2","3" and "4".

Using has("id",without("idList")) fails to filter the list and returns "1","2","3","4".

However if I manually type has("id", without("1","2","3"))  it returns "4" efficiently :

"name": "GetEdges", "resultCount": 1
"name": "GetNeighborVertices", "resultCount": 1

Alternatively I can store an edge list using store("edgeList"), and use .where(without("edgeList")) but this is inefficient :

"name": "GetEdges", "resultCount": 4
"name": "GetNeighborVertices", "resultCount": 1

with 3 unnecessary GetEdge steps.


Here is an example of has(id, without("idList")) not working using Modern graph.
gremlin>  graph = TinkerFactory.createModern()
==>tinkergraph[vertices:6 edges:6]
gremlin
> g = graph.traversal()
==>graphtraversalsource[tinkergraph[vertices:6 edges:6], standard]

gremlin
> g.V().has(id,1).outE("created").store("edgeList").store("idList").by(id).inV().inE().where(without("edgeList"))
==>e[11][4-created->3]
==>e[12][6-created->3]

gremlin
> g.V().has(id,1).outE("created").store("edgeList").store("idList").by(id).inV().inE().has(id, without("idList"))
==>e[9][1-created->3]
==>e[11][4-created->3]
==>e[12][6-created->3]

gremlin
> g.V().has(id,1).outE("created").store("edgeList").store("idList").by(id).inV().inE().has(id, without(9,11))
==>e[12][6-created->3]


Is there a way to efficiently get edges excluding a previously stored list of edges using Azure Cosmos DB?

Oliver Towers

unread,
Jul 26, 2019, 4:30:04 PM7/26/19
to Gremlin-users
Hi Harry,

The difference in behavior between has() and where() using without() predicate is expected.

For step:

   has(<property>, without(<value_0>, <value_1>, ...))

<value_i> are a treated as a scalar values to filter <property> on. In Cosmos DB, this resolves to an index-lookup filter: <properykey> not in (<value_0>, <value_1>, ...).

So has(id, without('idList'))  won't apply the filter as you are expecting, as it won't resolve 'idList' as a variable reference.


For step:

    where(without(<key>))


the <key> parameter is treated as a reference to a side-effect, map, path label, so in this case it will resolve 'idList'

In Cosmos DB, where(...) expressions compile to runtime evaluated filters in most cases. A few cases can optimize to index filters, but not for the examples you provided.
In the case of runtime evaluation, results from the preceding operators are streamed to the where filter for evaluation, so additional intermediate GetEdge results will be enumerated as shown in the execution profile.

Given this, there isn't an alternative traversal that will perform more efficiently in Cosmos DB.

However, you could improve the efficiency by breaking the operation into multiple requests.

So for this traversal:

    g.V().has(id, 1).outE("created").store("edgeList").inV().inE().where(without("edgeList"))


Instead of reading streaming inE() results to apply the where() filter, this could be split into two traversals:
  1. Resolves the outE() edges
  2. Resolves the set of inE() edges not included in the outE() results.

For example:

gremlin> g.V().has(id, '1').outE('created')

==>e[9][1-created->3]



Take vertex '3' and edge id '9'

gremlin> g.V('3').inE().has(id, without('9'))

==>e[11][4-created->3]
==>e[12][6-created->3]


If there are multiple edges then the second query can be expressed as:

    g.V(vId0, vId1, ...).inE().has(id, without(eId0, eId1, ...))

Harry O'Connor

unread,
Jul 27, 2019, 6:40:17 AM7/27/19
to Gremlin-users

Hi Oliver, thanks for your answer! It was just what I needed to see.

Your suggestion to break the query into multiple steps was very interesting, but due to reading the vertices twice does not seem to improve efficiency in my case. I tend to have 1-3 in edges per vertex, with around 1 edge in a list not to read. So your suggestion would save reading 1-2 edges each time but at the cost of reading an additional vertex, which in my case is not efficient.

Looking at the JSON edge output I noticed it has an "outV" property, so to avoid multiple vertex reads, I tried doing g.E().has("Link","outV","V1") but this results in no output. Doing this search based on other properties works, g.E().has("Link","num",100), works efficiently. Is there something special about the "outV" property that prevents this from working?

Harry O'Connor

unread,
Jul 27, 2019, 11:44:54 AM7/27/19
to Gremlin-users
Also a quick clarification, you say where(...) can optimize to index filters in some cases, is there a list of such cases?
It has come up a couple of times in my project.
Thanks, Harry.

Luis Bósquez

unread,
Jul 28, 2019, 7:49:41 PM7/28/19
to Gremlin-users
Hello Harry,

To address your question about using this query:
g.E().has("Link","outV","V1")

The edges don't have a property called "outV", this value is assigned to the response. The values embedded in the edges that represent the vertex they're pointing to are named "_sink". So this query should work:

g.E().has("Link","_sink","V1")

Hope this helps!

Luis
Reply all
Reply to author
Forward
0 new messages