Spark Gremlin

114 views
Skip to first unread message

Dovid Kopel

unread,
Nov 14, 2016, 8:19:47 PM11/14/16
to Gremlin-users
Marko (and the Gremlin community),
 
First off thanks for your great contributions to the OSS community. I’ve been building a framework that utilizes graphs and we have been using gremlin with TinkerGraph as the implementation. We want the capacity to compute data and generate graphs as well as query them in a distributed fashion. I naively saw that there was the spark-gremlin project and assumed I would have a similar feature-set as I do with TinkerGraph. I didn’t realize that anything that is based on the Hadoop Graph is essentially read only with respect to the graph. 
 
I would like to make the graph mutable and return a new RDD upon each change. Presumably when adding a vertex, edge, or changing properties. I would think that those calls would look slightly different returning an RDD instead. It is possible to simple hide the RDD aspect from the user and overwrite the RDD upon each change.
 
I was going to use TinkerGraph as a minimal template and just make all the “elements” an RDD and parallelize them. I am assuming that your use case was processing large graphs and did not really intend on supporting large scale graph creation. I wanted to ask you if this was your intention or if there is anything else I should consider prior to attempting an implementation.
 
Thanks,
DBK

Marko Rodriguez

unread,
Nov 15, 2016, 8:35:00 AM11/15/16
to gremli...@googlegroups.com
Hello,

Graph mutation in OLAP is tricky and something that we have slated, but would need a solid 2+ weeks of focused effort to accomplish — along with a major release to merge it to.

Here are the complications of OLAP mutations:

1. If you delete an “out” edge from a vertex, the next iteration needs to delete the corresponding “in” edge at the adjacent vertex.
2. If you delete a vertex, the next iteration needs to delete all edges incident to that vertex.
3. If you delete an edge property from a vertex, the next iteration needs to delete the corresponding edge property at the adjacent vertex.

What this means is that there needs to be two “phases” to an OLAP iteration. 

1. Collect all mutation messages and mutate.
2. Collect all processing messages and process.

Given that VertexPrograms work solely with M messages, the mutation messages would need to be something “behind the scenes.” Implementing this is not hard, but it takes alot of care:

1. Lots more GraphComputerTest cases.
2. Additions and corresponding tests to the GraphComputer.Features. 
3. Adding mutation message semantics.
4. Probably will require changes to the VertexProgram API.
5. OLAP engine providers will have a lot of work to do to update their implementations.

So, generating graphs in OLAP is possible and something we have been kicking down the road. If you get an implementation going, I would love to look at it to see how you did it.

Keep in touch,
Marko.
--
You received this message because you are subscribed to the Google Groups "Gremlin-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gremlin-user...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/gremlin-users/9e417c09-1a62-47a1-98d5-4ea0220ac211%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

HadoopMarc

unread,
Nov 16, 2016, 8:52:26 AM11/16/16
to Gremlin-users
Hi all,

Good to hear that the HadoopGraph modification feature has become more concrete and is still on the roadmap. This will allow the following use of large datasets that do no change very frequently:

 - convert raw data to a large kryo input file for hadoopgraph (or possibly upgrade the BulkLoaderVertexProgram to write to HadoopGraph)
 - do OLAP gremlin inference of edges and store the modifications
 - have the resulting HadoopGraph available for gremlin OLAP queries
 - bulkload the resulting HadoopGraph into S2Graph for interactive serving

Cheers,     Marc

Op dinsdag 15 november 2016 02:19:47 UTC+1 schreef Dovid Kopel:

Florian Hockmann

unread,
Nov 16, 2016, 9:28:01 AM11/16/16
to Gremlin-users
Hi,

I am also really looking forward to this feature getting implemented. Two important use cases from my point of view are:
  • Transforming the schema for older data: e.g., change the data type of a certain property from string to Date or convert a property into an edge to another vertex
  • Do some machine learning in OLAP and store the results directly in the graph, e.g., clustering and connect all vertices with an edge to their cluster vertex.
The first use case was also mentioned in the respective ticket: https://issues.apache.org/jira/browse/TINKERPOP-942

Currently, OLAP is primarily a way to perform queries that require traversing many vertices for me, but with the possibility to modify the graph with an OLAP job it would become much more useful.

Regards,
Florian

Dovid Kopel

unread,
Nov 29, 2016, 5:27:26 AM11/29/16
to Gremlin-users

I spent around two days and have basic functionality. I did not make a pull request yet. I haven't created ample tests and coverage.

I intentionally abstracted the RDD storage from the graph to allow for future changes.

I had fully functional traversals using the built-in strategies.

Please let me know what you think so far. If you think this is a reasonable approach I can clean things up and make an official pull request.

-DBK

Robert Dale

unread,
Nov 29, 2016, 5:34:53 AM11/29/16
to gremli...@googlegroups.com
Dovid,

If you create a jira ticket and make all your changes in a branch, it will be much easier to track your work.  See also https://tinkerpop.apache.org/docs/current/dev/developer/#_contributing_code_changes


Robert Dale

--
You received this message because you are subscribed to the Google Groups "Gremlin-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gremlin-users+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/gremlin-users/f74e561f-86d8-4bbb-a3fd-2f0aacaa2be9%40googlegroups.com.

Dovid Kopel

unread,
Nov 30, 2016, 3:30:56 PM11/30/16
to Gremlin-users

Robert Dale

To unsubscribe from this group and stop receiving emails from it, send an email to gremlin-user...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages