Filter vertices based on multi properties

644 views
Skip to first unread message

Lilly

unread,
Nov 14, 2019, 2:57:58 AM11/14/19
to Gremlin-users
Hi everyone,

I have a vertex multi property "records" and I would like to filter for a pair of vertices based of their two multi properties having non empty intersecting "records"sets.
For example if v1 has "records" of value 1,2 and 3 and v2 has records 2,5,6,4 these should come out of the query (in fact I only want to connect them by an edge so no need to store the pair for later use, just filter for them).

I tried something like this, but neither of them worked 

g.V().as("v1").V().where(P.within("v1")).by(__.values("records"))

g.V().limit(1).local(__.values("records")).as("r").
V().local(__.values("records")).as("r2").filter(__.where(P.within("r")))

g.V().as("v1").V().has("records",P.within(__.select("v1").values("records")))
Any further suggestions would be very much appreciated! It would be great if the query could be formulated in such a way that it also works for OLAP.
Thanks a lot!

Lilly

Daniel Kuppitz

unread,
Nov 16, 2019, 2:51:17 AM11/16/19
to gremli...@googlegroups.com
Let's start with your sample graph:

gremlin> g = TinkerGraph.open().traversal()
==>graphtraversalsource[tinkergraph[vertices:0 edges:0], standard]
gremlin> g.addV('v1').
......1>     property(list, 'records', 1).
......2>     property(list, 'records', 2).
......3>     property(list, 'records', 3).
......4>   addV('v2').
......5>     property(list, 'records', 2).
......6>     property(list, 'records', 5).
......7>     property(list, 'records', 6).
......8>     property(list, 'records', 4).
......9>   iterate()

Now, to determine the vertex pairs you can do this (works in OLTP and OLAP):

gremlin> g.withComputer().
......1>   V().aggregate('v').as('a').
......2>     map(values('records').fold()).as('av').
......3>   select('v').unfold().as('b').
......4>     where(gt('a')).by(id).
......5>     not(where(out().as('a'))).
......6>     not(where(__.in().as('a'))).
......7>     filter(values('records').where(within('av'))).
......8>   select('a','b')
==>[a:v[0],b:v[4]]

If you want to create edges in the same query though, you have to use OLTP:

gremlin> g.V().aggregate('v').as('a').
......1>     map(values('records').fold()).as('av').
......2>   select('v').unfold().as('b').
......3>     where(gt('a')).by(id).
......4>     not(where(out().as('a'))).
......5>     not(where(__.in().as('a'))).
......6>     filter(values('records').where(within('av'))).
......7>   addE('link').
......8>     from('a').to('b')
==>e[17][0-link->4]

Cheers,
Daniel


--
You received this message because you are subscribed to the Google Groups "Gremlin-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gremlin-user...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/gremlin-users/d5881d2a-1071-4379-ad2b-2e3c06f4b297%40googlegroups.com.

Lilly

unread,
Nov 16, 2019, 3:10:29 AM11/16/19
to Gremlin-users
Hi Daniel,

Thanks for elaborative example.

I kind of suspected, that it is not possible to add edges via OLAP.
My concern with your first suggestion for OLAP (same problem in OLTP too) is that in my case this will not fit into RAM, hence I could not store the result in a list or alike. Is there maybe a way to write the results to a file or something else buffer like directly?
Thanks so much!

Lilly
To unsubscribe from this group and stop receiving emails from it, send an email to gremli...@googlegroups.com.

Josh Perryman

unread,
Nov 16, 2019, 10:45:39 AM11/16/19
to Gremlin-users
Some of these requirements seem specific to the environment.  Apache TinkerPop provides a reference implementation of Gremlin Server and Gremlin Console, but they are best for toy examples, and not designed for work at significant scale.

However, there are tools that can work with graph data at scale from various vendors. I'm most familiar with DataStax's DSE Graph, which offers DseGraphFrames for using Apache Spark as a distributed compute tool.  One of the advantages of using either Apache Spark's GraphFrames, or the DseGraphFrames implementation, is that you can add the edges and make other changes to the graph.  You also get better tools for caching or other persistence management approaches. 

I believe that JanusGraph offers similar capabilities (OLTP + Apache Spark for analytics) but I haven't worked much with that ecosystem. 

If you want to share more about the specifics of your stack and environment, then we can probably offer some more specific guidance on the tools which would be available to address these types of problems but at greater scale. 

-Josh

Lilly

unread,
Nov 18, 2019, 10:55:59 AM11/18/19
to Gremlin-users
Hi Josh,

Thanks for your answer.
I am using Janusgraph with Cassandra backend and Sparkgraphcomputer for OLAP interactions (I probably should have mentioned this, but I thought my demand was independent of the environment).
However, I could not find any way to reach my goal of setting all these edges. In OLAP I believe it is not possible to add edges and I cannot store all these results, and in OLTP I could only think of iterating through all vertices one at a time which is horribly slow.

Lilly

Josh Perryman

unread,
Nov 19, 2019, 9:24:30 AM11/19/19
to Gremlin-users
Lilly, 

Yes, that is correct. "Gremlin OLAP" cannot be used to make changes to the graph. 

This can be a confusing part of the ecosystem.  Gremlin OLAP, which is commonly just referred to as OLAP, uses the GraphComputer for large-scale graph analytics.  

The Apache Spark ecosystem has a GraphFrames API which has the following advantages: 
  • simple abstraction of graph elements (vertices, edges)
  • support for motifs such as: g.find("(a)-[e]->(b); (b)-[e2]->(a)")
  • ability to process the entire graph
  • ability to change the graph
  • full access to the rest of the Spark APIs (including support for Scala, Python & Java) and functionality (caching and persistence controls)
But, it doesn't look like there is Spark to JanusGraph connector in existence.  

As I recall, the JanusGraph data model in Cassandra is pretty straightforward.  You could introduce Spark into the environment and connect it directly to Cassandra (bypassing JanusGraph). You would have to define your own GraphFrames. That would allow you to do extensive processing of the data and make changes to it.  

But it does add a whole new technology stack to your environment.  And this processing would bypass the JanusGraph engine, accessing the data directly.  There's a risk that changes through Spark could leave the data in a state that JanusGraph can no longer use it.  It seems like it is somewhat new territory and so there is a fair bit of risk involved. 

All that being said, if you did take that path I know that the JanusGraph community would appreciate any learnings, tools or connectors that could be shared back with the rest of the community.  

I hope this clarifies things a bit.  Good luck, 

Josh

Lilly

unread,
Nov 19, 2019, 9:47:15 AM11/19/19
to Gremlin-users
Hi Josh,

First of all thanks a lot for your reply!
Now I am a little confused. I was actually under the impression that I am actually using Spark GraphComputer without any use of janusgraph.
Janusgraph kindly provided a configuration script to do this (so I thought) with the following settings:

gremlin.graph=org.apache.tinkerpop.gremlin.hadoop.structure.HadoopGraph
gremlin.hadoop.graphReader=org.janusgraph.hadoop.formats.cassandra.Cassandra3InputFormat
gremlin.hadoop.graphWriter=org.apache.tinkerpop.gremlin.hadoop.structure.io.gryo.GryoOutputFormat
gremlin.hadoop.jarsInDistributedCache=true
gremlin.hadoop.inputLocation=none
gremlin.hadoop.outputLocation=output
gremlin.spark.persistContext=true
janusgraphmr.ioformat.conf.storage.backend=cassandra
janusgraphmr.ioformat.conf.storage.hostname=127.0.0.1
janusgraphmr.ioformat.conf.storage.port=9160
janusgraphmr.ioformat.conf.storage.cassandra.keyspace=janusgraph
cassandra.input.partitioner.class=org.apache.cassandra.dht.Murmur3Partitioner
spark.master=local[4]
spark.executor.memory=8g
spark.serializer=org.apache.spark.serializer.KryoSerializer
spark.kryo.registrator=org.apache.tinkerpop.gremlin.spark.structure.io.gryo.GryoRegistrator
gremlin.spark.graphStorageLevel=MEMORY_AND_DISK

To use spark I then create a graph instance with these settings and use
Graph.traversal().withComputer(SparkGraphComputer.class)
for the graphtraversalsource. I also have to include spark headers and stuff, so I thought this is what I am doing.

Nonetheless, if I try to add edges this way I get the Error: addEdge is currently not supported for Sparkgraphcomputer. This would be in conflict to being able to change the graph if I understand you correctly?

Thanks in advance!
Lilly

Josh Perryman

unread,
Nov 19, 2019, 12:28:36 PM11/19/19
to Gremlin-users
Correct, that is Gremlin OLAP using the Spark engine.  But it isn't GraphFrames.  

Gremlin OLAP cannot make changes to the graph, and doesn't support all of the TinkerPop steps. 

Gremlin OLAP gives a consistent interface (Gremlin w/ the GraphTraversalSource) to the analytics engine. In this case you're using Spark but I believe that other analytical processing engines (like map-reduce) could be used as well. But which compute engine is being used doesn't matter because everything is done with a subset of Gremlin steps. 

Spark's GraphFrames is a completely different API, and a distinct way of working with graph data using the Spark engine. I'm more familiar with the DataStax version, DseGraphFrames, and that API has some helpful updateVertices() and updateEdges() functions.  It doesn't look like those exist in the Apache Spark GraphFrames API.  Spark could still be used to make changes to the graph, but there's a lot of additional code that would need to be written to do so in a safe manner. 

I know that you're dealing with non-trivial data sizes, along with the complexity of working with highly connected data.  The truth is that there are few that are working in this dark corner of data processing who are talking about it and the tooling is very weak. I think that we're going to a see more and more work happening in this area in the coming years, but for right now it is going to take a fair bit of additional effort to solve this type of problem. 

Josh

Stephen Mallette

unread,
Nov 20, 2019, 1:43:24 PM11/20/19
to gremli...@googlegroups.com

To unsubscribe from this group and stop receiving emails from it, send an email to gremlin-user...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/gremlin-users/10899365-b539-4e42-973d-f4452aa177ea%40googlegroups.com.

Lilly

unread,
Nov 21, 2019, 3:51:48 AM11/21/19
to Gremlin-users
HI Daniel,

One more thing regarding your suggestion.
I was trying your first option with graphcomputer. Even on leaving out line 5 and 6 it still won't run, complaining that the id step of line 4 requires edges.
However, it seems in your example this works fine.

I am using Java and adapted the whole query like this:

g.V().aggregate("v").as("a").
        map(__.values("records")).fold()).as("av").

select("v").unfold().as("b").
where(P.gt("a")).by(__.id()).
filter(__.values("records").
where(P.within("av"))).select("a","b").next()

Since "id" is
Am Samstag, 16. November 2019 08:51:17 UTC+1 schrieb Daniel Kuppitz:
To unsubscribe from this group and stop receiving emails from it, send an email to gremli...@googlegroups.com.

Lilly

unread,
Nov 21, 2019, 3:53:32 AM11/21/19
to Gremlin-users
* Sorry still wanted to add:
I didn't know how else to query for id in java. Maybe this is the problem

Thanks
Lilly

Lilly

unread,
Nov 21, 2019, 3:55:54 AM11/21/19
to Gremlin-users
Hi Josh,

I see! Sorry it took me a while to understand what you mean.
Ok, well unfortunate there is nothing more so far. But thanks anyway for all your help and explanations!
I will see what I can do to solve my problem.

Thanks again,
Lilly

Daniel Kuppitz

unread,
Nov 21, 2019, 8:22:02 PM11/21/19
to gremli...@googlegroups.com
Even on leaving out line 5 and 6 it still won't run, complaining that the id step of line 4 requires edges.

Can you post the exact error message you are seeing?

I didn't know how else to query for id in java

It's T.id, not __.id().

Cheers,
Daniel


To unsubscribe from this group and stop receiving emails from it, send an email to gremlin-user...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/gremlin-users/13235575-94b5-4528-9192-0a46118c0390%40googlegroups.com.

Lilly

unread,
Nov 27, 2019, 4:27:06 AM11/27/19
to Gremlin-users
Hi Daniel,

Sorry for the late reply, I could not test it again until now.
So it seems this works (by using T.id). However, I do not quite understand why.
In OLTP my version works fine. Do you know why this is?

Thanks so much,
Lilly

Daniel Kuppitz

unread,
Nov 27, 2019, 12:43:09 PM11/27/19
to gremli...@googlegroups.com
So it seems this works (by using T.id). However, I do not quite understand why.
In OLTP my version works fine. Do you know why this is?

Ultimately, __.id() and T.id should yield the same result. It's different code that's executed under the hood (using the token should always be faster), but there should be no difference at all in the result. Perhaps it's a bug in JG.

Cheers
Daniel


To unsubscribe from this group and stop receiving emails from it, send an email to gremlin-user...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/gremlin-users/a1f82aaa-e395-4d77-b3c6-f40b8146e836%40googlegroups.com.
Reply all
Reply to author
Forward
0 new messages