Filtering a nested group statement in OLAP Gremlin

133 views
Skip to first unread message

Jen

unread,
May 19, 2016, 12:49:46 PM5/19/16
to Gremlin-users
Hi,

I have a question about filtering a nested group statement in OLAP (Spark) Gremlin with Tinkerpop 3.1.1, related to another recent question of mine:

I have created a nested group statement, like this example with the Grateful Dead data:
resMap = g.V().hasLabel('song').match(
__.as('songV').values('name').as('songName'),
__.as('songV').out('followedBy').as('followedV'),
__.as('followedV').values('name').as('followedName'),
__.as('followedV').out('sungBy').as('sungByV'),
__.as('sungByV').values('name').as('sungBy'),
__.as('followedV').out('writtenBy').as('writtenByV'),
__.as('writtenByV').values('name').as('writtenBy')).
select('songName','followedName','sungBy','writtenBy').
group().by(select('songName')).by(group().by(select('followedName')).by(select('sungBy').fold())).next()
Which returns results like:
==>WE BID YOU GOODNIGHT={FOOLISH HEART=[Garcia], SHAKEDOWN STREET=[Garcia], BEAT IT ON DOWN THE LINE=[Weir], HELL IN A BUCKET=[Weir_Mydland], PROMISED LAND=[Weir], MEXICALI BLUES=[Weir], TOUCH OF GREY=[Garcia], JACK STRAW=[Weir], MORNING DEW=[Garcia], COLD RAIN AND SNOW=[Garcia], BIG RAILROAD BLUES=[Garcia], FEEL LIKE A STRANGER=[Weir]}
==>I WANT YOU={BALLAD OF A THIN MAN=[Weir]}
==>STIR IT UP={NEW MINGLEWOOD BLUES=[Weir], DRUMS=[Grateful_Dead]}
==>SALT LAKE CITY={FRIEND OF THE DEVIL=[Garcia_Dawson]}
==>WILLIE AND THE HAND JIVE={CANDYMAN=[Garcia], IKO IKO=[Garcia], THE WHEEL=[Garcia_Kreutzmann], ROW JIMMY=[Garcia], GOING DOWN THE ROAD FEELING BAD=[Garcia], DRUMS=[Grateful_Dead]}
==>I NEED A MIRACLE={FOOLISH HEART=[Garcia], CUMBERLAND BLUES=[Garcia_Lesh], DARK STAR=[Garcia], BLACK PETER=[Garcia], CHINA CAT SUNFLOWER=[Garcia], COMES A TIME=[Garcia], TOUCH OF GREY=[Garcia], TERRAPIN STATION=[Garcia], MORNING DEW=[Garcia], DRUMS=[Grateful_Dead], WHARF RAT=[Garcia], GOOD LOVING=[Pigpen_Weir_Mydland], EYES OF THE WORLD=[Garcia], WANG DANG DOODLE=[Weir], MAGGIES FARM=[All], HERE COMES SUNSHINE=[Garcia], BROWN EYED WOMEN=[Garcia], SHAKEDOWN STREET=[Garcia], MIGHT AS WELL=[Garcia], PLAYING IN THE BAND=[Weir_Hart], DEATH DONT HAVE NO MERCY=[Garcia], NEW MINGLEWOOD BLUES=[Weir], HEY JUDE=[Pigpen_Mydland], SCARLET BEGONIAS=[Garcia], AROUND AND AROUND=[Weir], STANDER ON THE MOUNTAIN=[Hornsby, Hornsby, Hornsby, Hornsby], STANDING ON THE MOON=[Garcia], THROWING STONES=[Weir], CASEY JONES=[Garcia], SUNSHINE DAYDREAM=[Weir], UNCLE JOHNS BAND=[Garcia], HES GONE=[Garcia], SHE BELONGS TO ME=[Weir_Garcia], ITS ALL OVER NOW=[Pigpen_Weir], GOING DOWN THE ROAD FEELING BAD=[Garcia], THAT WOULD BE SOMETHING=[Garcia], STAGGER LEE=[Garcia], STELLA BLUE=[Garcia], BERTHA=[Garcia], CRAZY FINGERS=[Garcia], ATTICS OF MY LIFE=[Garcia], THE WHEEL=[Garcia_Kreutzmann], SO MANY ROADS=[Garcia], CHINA DOLL=[Garcia], JAM=[Grateful_Dead], COLD RAIN AND SNOW=[Garcia], LAZY LIGHTNING=[Weir], SHIP OF FOOLS=[Garcia]}

Now I would like to add a filter in the group statement returning only results where there are more than 10 songs in a "followedBy" group for a single result row. For example, these results would be filtered out:
==>HARD TO HANDLE={TELL MAMA=[Etta_James]}
==>WHY DONT WE DO IT IN THE ROAD={CHINA CAT SUNFLOWER=[Garcia], AROUND AND AROUND=[Weir], DRUMS=[Grateful_Dead], STELLA BLUE=[Garcia], BERTHA=[Garcia]}
but these results would not:
==>SUNSHINE DAYDREAM={SAMSON AND DELILAH=[Weir], HELL IN A BUCKET=[Weir_Mydland], ONE MORE SATURDAY NIGHT=[Weir], TOUCH OF GREY=[Garcia], BIRDSONG=[Garcia], WHARF RAT=[Garcia], GOOD LOVING=[Pigpen_Weir_Mydland], BROKEDOWN PALACE=[Garcia], FEEL LIKE A STRANGER=[Weir], SHAKEDOWN STREET=[Garcia], IKO IKO=[Garcia], DONT EASE ME IN=[Garcia], AROUND AND AROUND=[Weir], FOREVER YOUNG=[Neil_Young], COLD RAIN AND SNOW=[Garcia], TURN ON YOUR LOVE LIGHT=[Pigpen_Weir], BOX OF RAIN=[Lesh], SHIP OF FOOLS=[Garcia]}
==>GOING DOWN THE ROAD FEELING BAD={BLACK PETER=[Garcia], HELL IN A BUCKET=[Weir_Mydland], JOHNNY B GOODE=[Weir], NOT FADE AWAY=[Weir], PROMISED LAND=[Weir], MORNING DEW=[Garcia], BIRDSONG=[Garcia], DRUMS=[Grateful_Dead], WHARF RAT=[Garcia], GOOD LOVING=[Pigpen_Weir_Mydland], EYES OF THE WORLD=[Garcia], PLAYING IN THE BAND=[Weir_Hart], NEW MINGLEWOOD BLUES=[Weir], JACK STRAW=[Weir], BABY WHAT YOU WANT ME TO DO=[Pigpen_Mydland], AROUND AND AROUND=[Weir], STANDING ON THE MOON=[Garcia], TURN ON YOUR LOVE LIGHT=[Pigpen_Weir], THROWING STONES=[Weir], CASEY JONES=[Garcia], SUNSHINE DAYDREAM=[Weir], UNCLE JOHNS BAND=[Garcia], SUGAR MAGNOLIA=[Weir], ONE MORE SATURDAY NIGHT=[Weir], ALABAMA GETAWAY=[Garcia], STELLA BLUE=[Garcia], ESTIMATED PROPHET=[Weir], I NEED A MIRACLE=[Weir], BLUES FOR ALLAH=[Garcia], ATTICS OF MY LIFE=[Garcia], BROKEDOWN PALACE=[Garcia], FEEL LIKE A STRANGER=[Weir], HARD TO HANDLE=[Pigpen], THE WHEEL=[Garcia_Kreutzmann], ALL ALONG THE WATCHTOWER=[Weir], HEY BO DIDDLEY=[Garcia]}

I've tried a few things but nothing has worked so far. Any help would be appreciated.

Jen

Daniel Kuppitz

unread,
May 19, 2016, 3:42:30 PM5/19/16
to gremli...@googlegroups.com
Hi Jen,

you need to unfold() the outer map in order to filter its entries. This query worked for me in 3.2.1-SNAPSHOT:

g.V().hasLabel('song').match(
    __.as('songV').values('name').as('songName'),
    __.as('songV').out('followedBy').as('followedV'),
    __.as('followedV').values('name').as('followedName'),
    __.as('followedV').out('sungBy').as('sungByV'),
    __.as('sungByV').values('name').as('sungBy'),
    __.as('followedV').out('writtenBy').as('writtenByV'),
    __.as('writtenByV').values('name').as('writtenBy')
  ).select('songName','followedName','sungBy','writtenBy').

    group().by(select('songName')).
            by(group().by(select('followedName')).
                       by(select('sungBy').fold())).unfold().
    filter(select(values).count(local).is(gt(10)))

... only in OLTP though.

Cheers,
Daniel


--
You received this message because you are subscribed to the Google Groups "Gremlin-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gremlin-user...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/gremlin-users/d05a2dfc-0848-46da-8517-d5ded85291f5%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Marko Rodriguez

unread,
May 19, 2016, 4:22:47 PM5/19/16
to gremli...@googlegroups.com
@Daniel: Why doesn't it work in OLAP?

Marko.

Daniel Kuppitz

unread,
May 19, 2016, 4:41:04 PM5/19/16
to gremli...@googlegroups.com
The nested grouping throws exceptions in 3.2.1 (I believe it were NoSuchElementExceptions).

Jen

unread,
May 20, 2016, 4:48:31 AM5/20/16
to Gremlin-users
When I try Daniel's query in 3.1.1 OLAP Gremlin (Spark) I get the classic error:
Global traversals on GraphComputer may not contain mid-traversal barriers: GroupStep([SelectOneStep(songName)],[GroupStep([SelectOneStep(followedName)],[SelectOneStep(sungBy), FoldStep])])

Any idea if there's a way to do this in OLAP gremlin? I can't really use OLTP for my query.

Jen

Daniel Kuppitz

unread,
May 20, 2016, 7:09:33 AM5/20/16
to gremli...@googlegroups.com
Hi Jen,

would it be sufficient to filter right at the beginning based on the number of out-edges?

g.V().hasLabel('song').filter(outE('followedBy').count().is(gt(10))).match(
  ...

Cheers,
Daniel




Marko Rodriguez

unread,
May 20, 2016, 9:19:32 AM5/20/16
to gremli...@googlegroups.com
Hi Jen,

Note that that exception is no longer a thing in 3.2.0. You can do mid-traversal barriers in 3.2.0. Next, as Daniel pointed out earlier in the thread, your particular query in 3.2.0 gives an NPE :) ! However, I know why and will have it fixed for 3.2.1

** I was dumb about a transient in a serialization.

Take care,
Marko.

Marko Rodriguez

unread,
May 24, 2016, 10:20:03 AM5/24/16
to gremli...@googlegroups.com, d...@tinkerpop.incubator.apache.org
Hi,

Note that I have fixed the nested group() OLAP NPE that is raised in 3.2.0. The work is currently in the PR below awaiting VOTEs for merge.


The following two new test cases exposed your NPE problem as well as a serialization problem for "over-the-wire"-based OLAP engines.


My original solution to this problem yielded poor performance (5x worse than master/), but after some "ah ha!"-moments, I was able to arrive at the same performance of master/.


HTH,
Marko. 
Reply all
Reply to author
Forward
0 new messages