Patterns for match() step without starting labels

135 views
Skip to first unread message

Toshio Ito

unread,
Nov 23, 2020, 6:47:48 AM11/23/20
to Gremlin-users
(I'm testing with org.apache.tinkerpop:gremlin-core:3.3.0 (sorry, it's a little old))

In Gremlin's match() step, is it allowed to pass match patterns without the starting labels (i.e. the as() step at the head)?

I suppose that a regular (non-where) match pattern must have the starting label. That's because I got an exception from the following expression.

Example (1):

    __.__(1,2,3,4).match(
        __.map { it.get() * 3 }.as("b")
    ).iterate();

However, it looks like we can use patterns using "where" step without the starting label.

Example (2):

    __.__(1,2,3,4).match(
        __.as("a").map { it.get() * 3 }.as("b"),
        __.where( __.map { it.get()  }.as("a") )
    ).iterate();

Example (3):

    __.__(1,2,3,4).match(
        __.as("a").map { it.get() * 3 }.as("b"),
        __.where(P.eq("b"))
    ).iterate();

The examples (2) and (3) didn't throw any exception. Is it an expected behavior?

Now, if "where" patterns without starting labels are officially allowed, what kind of traverser is supposed to run into the "where" pattern? The current document is unclear about it, I think.

Stephen Mallette

unread,
Nov 25, 2020, 6:35:55 AM11/25/20
to gremli...@googlegroups.com
It is expected behavior - you must specify a starting label to match(). You can start where() without a step label though and you simply treat the traverser as the Map produced by the match():

gremlin> g.V().match(__.as('a').out().as('b')).where(select('a').values('age').is(29))
==>[a:v[1]]
==>[a:v[1]]
==>[a:v[1]]
gremlin> g.V().match(__.as('a').out().as('b')).where(__.as('a').values('age').is(29))
==>[a:v[1],b:v[3]]
==>[a:v[1],b:v[2]]
==>[a:v[1],b:v[4]]




--
You received this message because you are subscribed to the Google Groups "Gremlin-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gremlin-user...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/gremlin-users/aa2edec7-c746-4e37-964c-26648ffba7bcn%40googlegroups.com.

Toshio Ito

unread,
Nov 25, 2020, 8:14:59 AM11/25/20
to Gremlin-users
> It is expected behavior - you must specify a starting label to match(). You can start where() without a step label though

Thanks for clarification.

> you simply treat the traverser as the Map produced by the match():

Well, my question was about the where() patterns INSIDE match(), not AFTER match(). What kind of traverser goes into where() patterns INSIDE match()?



By the way, your examples terribly surprised me.

> gremlin> g.V().match(__.as('a').out().as('b')).where(select('a').values('age').is(29))
> ==>[a:v[1]]
> ==>[a:v[1]]
> ==>[a:v[1]]

I have no idea about the reasoning behind this result. The result from match() step is a Map with two keys "a" and "b". However, the following where() step removes the key "b" from the input. Why? I have always thought that filtering steps don't transform the content of traversers. (maybe I should make another thread about it...)

Stephen Mallette

unread,
Nov 25, 2020, 9:05:35 AM11/25/20
to gremli...@googlegroups.com
On Wed, Nov 25, 2020 at 8:15 AM Toshio Ito <debu...@gmail.com> wrote:
> It is expected behavior - you must specify a starting label to match(). You can start where() without a step label though

Thanks for clarification.

> you simply treat the traverser as the Map produced by the match():

Well, my question was about the where() patterns INSIDE match(), not AFTER match(). What kind of traverser goes into where() patterns INSIDE match()?

sorry, i misunderstood. a where() inside of match() still needs a label. for instance you can't re-write my example as:

gremlin> g.V().match(__.as('a').out().as('b'),where(select('a').values('age').is(29)))
All match()-traversals must have a single start label (i.e. variable): [TraversalFilterStep([SelectOneStep(last,a), PropertiesStep([age],value), IsStep(eq(29))])]
Type ':help' or ':h' for help.
Display stack trace? [yN]

where() in this context does allow for a bit of shorthand though in specifying the label:

gremlin> g.V().match(__.as('a').out().as('b'),where(__.as('a').values('age').is(29)))

==>[a:v[1],b:v[3]]
==>[a:v[1],b:v[2]]
==>[a:v[1],b:v[4]]
gremlin> g.V().match(__.as('a').out('knows').as('b'),where('a',gt('b')).by('age'))

==>[a:v[1],b:v[2]]

You might consider checking the profile() of queries you are confused about - note that where() inside our out of match() look the same from that perspective:

gremlin> g.V().match(__.as('a').out('knows').as('b'),where(eq('b'))).profile()
==>Traversal Metrics
Step                                                               Count  Traversers       Time (ms)    % Dur
=============================================================================================================
TinkerGraphStep(vertex,[])                                             6           6           0.121    15.92
MatchStep(AND,[[MatchStartStep(a), ProfileStep,...                     2           2           0.641    84.08
  MatchStartStep(a)                                                    6           6           0.051
  VertexStep(OUT,[knows],vertex)                                       2           2           0.052
  MatchEndStep(b)                                                      2           2           0.098
  MatchStartStep                                                       2           2           0.018
  WherePredicateStep(eq(b))                                            2           2           0.037
  MatchEndStep                                                         2           2           0.047
                                            >TOTAL                     -           -           0.762        -
gremlin> g.V().match(__.as('a').out('knows').as('b')).where(eq('b')).profile()
==>Traversal Metrics
Step                                                               Count  Traversers       Time (ms)    % Dur
=============================================================================================================
TinkerGraphStep(vertex,[])                                             6           6           0.086    14.74
MatchStep(AND,[[MatchStartStep(a), ProfileStep,...                     2           2           0.503    85.26
  MatchStartStep(a)                                                    6           6           0.065
  VertexStep(OUT,[knows],vertex)                                       2           2           0.050
  MatchEndStep(b)                                                      2           2           0.090
  MatchStartStep                                                       2           2           0.017
  WherePredicateStep(eq(b))                                            2           2           0.037
  MatchEndStep                                                         2           2           0.048
                                            >TOTAL                     -           -           0.589        -


By the way, your examples terribly surprised me.

> gremlin> g.V().match(__.as('a').out().as('b')).where(select('a').values('age').is(29))
> ==>[a:v[1]]
> ==>[a:v[1]]
> ==>[a:v[1]]

I have no idea about the reasoning behind this result. The result from match() step is a Map with two keys "a" and "b". However, the following where() step removes the key "b" from the input. Why? I have always thought that filtering steps don't transform the content of traversers. (maybe I should make another thread about it...)

Good question. Gremlin has PathRetractionStrategy which removes path history from traversers if it can determine that they are no longer needed downstream. Since we explicitly reference "a" in where() Gremlin knows it needs that but since we don't reference "b" at all it figures it can remove that bit of history. Carrying around less history reduces the traversal resource requirements and may improve the likelihood of bulking. 

gremlin> g.V().match(__.as('a').out().as('b')).where(select('a').values('age').is(29))
==>[a:v[1]]
==>[a:v[1]]
==>[a:v[1]]
gremlin> g.V().match(__.as('a').out().as('b')).where(select('a').values('age').is(29)).select('a','b')

==>[a:v[1],b:v[3]]
==>[a:v[1],b:v[2]]
==>[a:v[1],b:v[4]]
gremlin> g.withoutStrategies(PathRetractionStrategy).V().match(__.as('a').out().as('b')).where(select('a').values('age').is(29))

==>[a:v[1],b:v[3]]
==>[a:v[1],b:v[2]]
==>[a:v[1],b:v[4]]

Whether PathRetractionStrategy is a bit overzealous in its pruning or not is perhaps debatable.

Stephen Mallette

unread,
Nov 25, 2020, 9:07:18 AM11/25/20
to gremli...@googlegroups.com
This discussion provides some more discussion on this topic:



Toshio Ito

unread,
Nov 26, 2020, 3:31:57 AM11/26/20
to Gremlin-users
About my second question: Thanks a lot about PathRetractionStrategy and the related topic (
 https://groups.google.com/g/gremlin-users/c/CD0EQNm7U3c/m/R_y4A2qkAgAJ ). Having read the thread, now my conclusion is: we should NEVER use the value carried by a traverser emitted from match() step, because it's so unpredictable. Instead, we should always refer to the path history of the traverser emitted from match(), for example, using select(), where() and dedup().

Now back to my first question. That is, where() pattern inside match() without starting labels.

I read the reference document ( https://tinkerpop.apache.org/docs/current/reference/#match-step ) again, and found out that there is no example that uses where() pattern inside match(). So, I suppose where() pattern inside match() is not officially supported, although the implementation doesn't complain about it.

That leaves the case in which where() step comes after match(). My conclusion above tells me to always use starting label in the where(). Without the starting label, the where() step might use the value of the traverser emitted from match(), which is unpredictable.

I ran the example you presented, but without profile().

gremlin> g.V().match(__.as("a").out("knows").as("b")).where(eq("b"))
==>[a:v[1],b:v[2]]
==>[a:v[1],b:v[4]]

Then, I just inserted identity() before where().

gremlin> g.V().match(__.as("a").out("knows").as("b")).identity().where(eq("b"))
(no result)

This weirdness comes from the fact that where(eq("b")) tries to compare the value with label "b" in path history with the value of the current traverser, which is unpredictable due to match() step. So, basically, we should give starting label to where() in this case. That way, the comparison is solely based on path history.

Stephen Mallette

unread,
Nov 26, 2020, 7:11:07 AM11/26/20
to gremli...@googlegroups.com
There are some pretty strong statements in your reply which I'm not sure I completely agree with. Replies inline:

On Thu, Nov 26, 2020 at 3:32 AM Toshio Ito <debu...@gmail.com> wrote:
About my second question: Thanks a lot about PathRetractionStrategy and the related topic (
 https://groups.google.com/g/gremlin-users/c/CD0EQNm7U3c/m/R_y4A2qkAgAJ ). Having read the thread, now my conclusion is: we should NEVER use the value carried by a traverser emitted from match() step, because it's so unpredictable. Instead, we should always refer to the path history of the traverser emitted from match(), for example, using select(), where() and dedup().

"never" and "unpredictable" aren't quite the right words to me. I think that once you know what the optimization for PathRetractionStrategy is, you can safely understand what you can use after match() and what you can't.
 
Now back to my first question. That is, where() pattern inside match() without starting labels.

I read the reference document ( https://tinkerpop.apache.org/docs/current/reference/#match-step ) again, and found out that there is no example that uses where() pattern inside match(). So, I suppose where() pattern inside match() is not officially supported, although the implementation doesn't complain about it.

I wouldn't say the lack of an example means that something in Gremlin isn't supported. We try to document the most frequent patterns of usage and even then I'm not sure we've captured all the different ways that Gremlin can go together. If there is a good pattern to document with where() inside of match() I suppose we should add it to the documentation, but I don't know what that is. 
 
That leaves the case in which where() step comes after match(). My conclusion above tells me to always use starting label in the where(). Without the starting label, the where() step might use the value of the traverser emitted from match(), which is unpredictable. 
 
I ran the example you presented, but without profile().

gremlin> g.V().match(__.as("a").out("knows").as("b")).where(eq("b"))
==>[a:v[1],b:v[2]]
==>[a:v[1],b:v[4]]

Then, I just inserted identity() before where().

gremlin> g.V().match(__.as("a").out("knows").as("b")).identity().where(eq("b"))
(no result)

This weirdness comes from the fact that where(eq("b")) tries to compare the value with label "b" in path history with the value of the current traverser, which is unpredictable due to match() step. So, basically, we should give starting label to where() in this case. That way, the comparison is solely based on path history.

So, you've properly demonstrated what's happening. If PathRetractionStrategy doesn't detect the where() and it therefore assumes "a" and "b" are no longer needed and therefore removed from the path history. That said, this particular situation looks like a bug to me or at least an over-optimization we might account for. I would have expected that last example to work actually because of IdentityRemovalStrategy, but then I checked and that strategy is not loaded globally. I have no idea why and that means that for the last 6 years I've thought that strategy executing when it hasn't been at all!

Anyway, you can see what I mean here:

gremlin> g.V().match(__.as("a").out("knows").as("b")).identity()
==>[]
==>[]
gremlin> g.withStrategies(IdentityRemovalStrategy.instance()).V().match(__.as("a").out("knows").as("b")).identity()

==>[a:v[1],b:v[2]]
==>[a:v[1],b:v[4]]

So, ultimately, you've found "something" but I'm not sure if I should call it a "bug" or an "inconsistency" or a "mystery" - perhaps in some respects you were right about the word "unpredictable" in a sense :) 


Ultimately, I think the final answer here is to do what I think you've come to understand. 

1. Prefer the pattern of where() directly following match() - it's one that is easily recognizable and readable
2. Be explicit with the path history you want extracted following match()

All these issues go away if you stick to that recipe. Thanks for the discussion.

 

Toshio Ito

unread,
Nov 26, 2020, 8:27:20 AM11/26/20
to Gremlin-users
Thanks for the good summary of discussion. And I'm sorry that I might have sounded a little offending.

> I think that once you know what the optimization for PathRetractionStrategy is, you can safely understand what you can use after match() and what you can't.

I agree. However, I still think studying all mechanics of optimization strategies is so difficult and time-consuming. In practice, I want to avoid as many pitfalls as possible, with the least effort.

> I wouldn't say the lack of an example means that something in Gremlin isn't supported. 

OK.

Personally, however, I don't want to use behavior of some software implementation that is not documented. That is because if that behavior is not documented, I'm not sure if it's a legitimate feature or a bug. If I make my own software depend on the behavior, and if the behavior is considered a bug and fixed later, then it will break my software. I don't want that to happen.


> Ultimately, I think the final answer here is to do what I think you've come to understand.

Thanks again for the summary.
Reply all
Reply to author
Forward
0 new messages