match() / select() discrepancy?

68 views
Skip to first unread message

jimsk...@gmail.com

unread,
Sep 4, 2020, 4:15:22 AM9/4/20
to Gremlin-users
In attempting to perform OR-like operations in a where() step applied to a match(), I stumbled on some discrepancies between match() and select().

The following traversals work as expected:
> g.inject([1,1], [1,2]).as("list").
        match(
            __.as("list").unfold().limit(1).as("head"),
            __.as("list").unfold().tail().as("tail")
        )
==>[head:1, tail:1, list:[1,1]]
==>[head:1, tail:2, list:[1,2]]

> g.inject([1,1], [1,2]).as("list").
        match(
            __.as("list").unfold().limit(1).as("head"),
            __.as("list").unfold().tail().as("tail")
        ).where("head", eq("tail"))
==>[head:1, tail:1, list:[1,1]]

If I insert an or() step around the where() then the returned map is missing the key "list":
> g.inject([1,1], [1,2]).as("list").
        match(
            __.as("list").unfold().limit(1).as("head"),
            __.as("list").unfold().tail().as("tail")
        ).or(where("head", eq("tail")))
==>[head:1, tail:1]

Perhaps even more bizarre is the effect of identity():
> g.inject([1,1], [1,2]).as("list").
        match(
            __.as("list").unfold().limit(1).as("head"),
            __.as("list").unfold().tail().as("tail")
        ).identity()
==>[]
==>[]

If or() is a filter step and identity is a nop, then shouldn't both leave the traversers that pass unaltered?

If I do a select() after the match(), I get back to the expected result, whether or not I use an or()/identity():

> g.inject([1,1], [1,2]).as("list").
        match(
            __.as("list").unfold().limit(1).as("head"),
            __.as("list").unfold().tail().as("tail")
     ).select("list", "head", "tail").or(where("head", eq("tail")))
==>[list:[1,1], head:1, tail:1]

It is my understanding that match() and select() should both produce a Map<String, Object>, and in the example above those key/value pairs should be identical in both. The fact that identity()/or(where(…)) behaves differently in each case seems like there may be a bug, else I have fundamentally misunderstood something.

I am using Tinkerpop 3.4.8.

Any insight would be much appreciated.

Jim

Stephen Mallette

unread,
Sep 4, 2020, 8:49:35 AM9/4/20
to gremli...@googlegroups.com
Nice questions - looks like you're really digging into Gremlin. I'll walk you through my thinking as I go through your questions, as I didn't immediately have exact answers to provide. My thinking can be found inline below:

In attempting to perform OR-like operations in a where() step applied to a match(), I stumbled on some discrepancies between match() and select().

oh boy, match() step - the step i know the least about....

The following traversals work as expected:
> g.inject([1,1], [1,2]).as("list").
        match(
            __.as("list").unfold().limit(1).as("head"),
            __.as("list").unfold().tail().as("tail")
        )
==>[head:1, tail:1, list:[1,1]]
==>[head:1, tail:2, list:[1,2]]

> g.inject([1,1], [1,2]).as("list").
        match(
            __.as("list").unfold().limit(1).as("head"),
            __.as("list").unfold().tail().as("tail")
        ).where("head", eq("tail"))
==>[head:1, tail:1, list:[1,1]]

ok - good simple example
 
If I insert an or() step around the where() then the returned map is missing the key "list":
> g.inject([1,1], [1,2]).as("list").
        match(
            __.as("list").unfold().limit(1).as("head"),
            __.as("list").unfold().tail().as("tail")
        ).or(where("head", eq("tail")))
==>[head:1, tail:1]

that's interesting but I think i know what's kinda happening there, but i'll need to verify. match().where() is a special case:


From that link about: "Using MatchPredicateStrategy, where()-clauses are automatically folded into match() and thus, subject to the query optimizer within match()-step"

and when you inject or() into the mix, the where() ceases to act as a modulator on match() and it just behaves as a more regular filter. I'll verify with profile():

gremlin> g.inject([1,1], [1,2]).as("list").
......1>         match(
......2>             __.as("list").unfold().limit(1).as("head"),
......3>             __.as("list").unfold().tail().as("tail")
......4>         ).where("head", eq("tail")).profile()
==>Traversal Metrics
Step                                                               Count  Traversers       Time (ms)    % Dur
=============================================================================================================
InjectStep([[1, 1], [1, 2]])@[list]                                    2           2           0.042     4.10
MatchStep(AND,[[MatchStartStep(list), ProfileSt...                     1           1           0.996    95.90
  MatchStartStep(list)                                                 2           2           0.020
  TraversalFlatMapStep([UnfoldStep, RangeGlobal...                     2           2           0.063
  MatchEndStep(head)                                                   2           2           0.110
  MatchStartStep(list)                                                 2           2           0.017
  TraversalFlatMapStep([UnfoldStep, ProfileStep...                     2           2           0.192
    UnfoldStep                                                         4           4           0.043
    TailGlobalStep(1)                                                  2           2           0.100
  MatchEndStep(tail)                                                   2           2           0.094
  MatchStartStep(head)                                                 2           2           0.028
  WherePredicateStep(eq(tail))                                         1           1           0.074
  MatchEndStep                                                         1           1           0.030
                                            >TOTAL                     -           -           1.038        -
gremlin> g.inject([1,1], [1,2]).as("list").
......1>         match(
......2>             __.as("list").unfold().limit(1).as("head"),
......3>             __.as("list").unfold().tail().as("tail")
......4>         ).or(where("head", eq("tail"))).profile()
==>Traversal Metrics
Step                                                               Count  Traversers       Time (ms)    % Dur
=============================================================================================================
InjectStep([[1, 1], [1, 2]])@[list]                                    2           2           0.037     4.85
MatchStep(AND,[[MatchStartStep(list), ProfileSt...                     2           2           0.591    76.86
  MatchStartStep(list)                                                 2           2           0.020
  TraversalFlatMapStep([UnfoldStep, RangeGlobal...                     2           2           0.073
  MatchEndStep(head)                                                   2           2           0.069
  MatchStartStep(list)                                                 2           2           0.017
  TraversalFlatMapStep([UnfoldStep, ProfileStep...                     2           2           0.187
    UnfoldStep                                                         4           4           0.020
    TailGlobalStep(1)                                                  2           2           0.100
  MatchEndStep(tail)                                                   2           2           0.064
OrStep([[WherePredicateStep(head,eq(tail)), Pro...                     1           1           0.140    18.29
  WherePredicateStep(head,eq(tail))                                                            0.032
                                            >TOTAL                     -           -           0.770        -


yep, two different queries when you place an or() in the mix. That only explains part of the "why this happens" though in that we know we have two separate queries but what is causing the behavior in each? Well, there is an interesting optimization that Gremlin has which is implemented within PathRetractionStrategy (another one of the few places where I find Gremlin code a bit mysterious). Before discussing it, I will demonstrate what happens when we don't use it:

gremlin> g.inject([1,1], [1,2]).as("list").
......1>   match(__.as("list").unfold().limit(1).as("head"),
......2>         __.as("list").unfold().tail().as("tail")).
......3>     or(where("head", eq("tail")))
==>[head:1,tail:1]

gremlin> g.inject([1,1], [1,2]).as("list").
......1>   match(__.as("list").unfold().limit(1).as("head"),
......2>         __.as("list").unfold().tail().as("tail")).
......3>     where("head", eq("tail"))

==>[head:1,tail:1,list:[1,1]]
gremlin> g.withoutStrategies(PathRetractionStrategy).
......1>   inject([1,1], [1,2]).as("list").
......2>   match(__.as("list").unfold().limit(1).as("head"),
......3>         __.as("list").unfold().tail().as("tail")).
......4>     or(where("head", eq("tail")))
==>[head:1,tail:1,list:[1,1]]


The idea of PathRetractionStrategy is that it tries to kill path history that it can detect is no longer needed in the traversal. With the strategy enabled it believes that it only needs "head" and "tail" to satisfy the or(where()) so it kills "list" to save some memory. When you use match().where() (without the or()) the where() gets folded up into the match() and PathRetractionStrategy has to assume "list" might yet be needed in the output.
 
Perhaps even more bizarre is the effect of identity():
> g.inject([1,1], [1,2]).as("list").
        match(
            __.as("list").unfold().limit(1).as("head"),
            __.as("list").unfold().tail().as("tail")
        ).identity()
==>[]
==>[]

With the functionality of PathRetractionStrategy in mind we can figure out why identity() behaves as it does:

gremlin> g.inject([1,1], [1,2]).as("list").
......1>   match(__.as("list").unfold().limit(1).as("head"),
......2>         __.as("list").unfold().tail().as("tail")).
......3>   identity()
==>[]
==>[]
gremlin> g.withoutStrategies(PathRetractionStrategy).
......1>   inject([1,1], [1,2]).as("list").
......2>   match(__.as("list").unfold().limit(1).as("head"),
......3>         __.as("list").unfold().tail().as("tail")).
......4>   identity()
==>[head:1,tail:1,list:[1,1]]
==>[head:1,tail:2,list:[1,2]]


If or() is a filter step and identity is a nop, then shouldn't both leave the traversers that pass unaltered?

yes and in essence it does leave it unaltered. the two traversers are still alive (i.e. you get two results, but each lacks path history). 
 
If I do a select() after the match(), I get back to the expected result, whether or not I use an or()/identity():

> g.inject([1,1], [1,2]).as("list").
        match(
            __.as("list").unfold().limit(1).as("head"),
            __.as("list").unfold().tail().as("tail")
     ).select("list", "head", "tail").or(where("head", eq("tail")))
==>[list:[1,1], head:1, tail:1]

I think it's clear now that PathRetractionStrategy gets informed by select() that we need to keep "list", "head" and "tail" so it leaves those alone. 
 
It is my understanding that match() and select() should both produce a Map<String, Object>, and in the example above those key/value pairs should be identical in both. The fact that identity()/or(where(…)) behaves differently in each case seems like there may be a bug, else I have fundamentally misunderstood something.

I'd say that the best practice should be to use select() to be explicit about what is returned. The same goes for graph elements and any other results (i.e. avoid SELECT * types of results). 
 
I am using Tinkerpop 3.4.8.

nice - latest and greatest
 
Any insight would be much appreciated.

 Thanks for presenting a nice, insightful question here with some good easy to follow examples. I'm going to make a note to do something with this thread - blog post, documentation update, etc - so that it's not just information trapped in the mailing list.

jimsk...@gmail.com

unread,
Sep 4, 2020, 11:27:58 AM9/4/20
to Gremlin-users
Hello Stephen,

Yes, I'm beginning to dig about a bit at the ragged edges.

Thank you very much for the detailed reply. I could see from the docs that the folding in of where() was causing the difference but not explain why it manifested as it did. Now I have a better understanding, so thanks.

There is a sentence in the match section of the docs that I think summarises where my confusion arises:
"Once the traverser has "survived" all the patterns (or at least one for or()), match()-step analyzes the traverser’s path history and emits a Map<String,Object> of the variable bindings to the next step in the traversal."

This implies to me that the match() step "concretises" the path information into a map traversal that it then emits in place of the input traverser (i.e. there is an implicit select()-like operation at the end of matching). From that viewpoint it is strange that the PathRetractionStrategy should decide any part of the associated path information is not required because it has become data. I guess that this could be a hinderance to certain optimisations, hence, as you suggest, make explicit select() calls.

I must admit that I don't quite get the identity() thing. I'm still of the opinion that the bare match() and the one with the trailing identity()-step should return the same thing; I'm now just less sure what that thing should be!

I like your simple proofs with the withoutStrategies() step. I can use that trick to try bug out things in the future.

Glad to provide an interesting question.

Stephen Mallette

unread,
Sep 4, 2020, 11:49:14 AM9/4/20
to gremli...@googlegroups.com
>  I must admit that I don't quite get the identity() thing. I'm still of the opinion that the bare match() and the one with the trailing identity()-step should return the same thing; I'm now just less sure what that thing should be!

yeah - I can see why you would think that way if you think in terms of the actual output of match() and not in terms of the traverser carrying the output. i suppose it could be argued that identity() should form an exception to PathRetractionStrategy....if you'd like to make a ticket in JIRA to consider the implications of such a change, feel free to do so.

--
You received this message because you are subscribed to the Google Groups "Gremlin-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gremlin-user...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/gremlin-users/e764226f-2932-4fed-b58c-c398da49d8ddn%40googlegroups.com.

jimsk...@gmail.com

unread,
Sep 4, 2020, 1:29:27 PM9/4/20
to Gremlin-users
I may well raise a ticket, although it doesn't feel urgent.

Thanks again.
Reply all
Reply to author
Forward
0 new messages