Change management for Gremlin-based graph DB

359 views
Skip to first unread message

Nikolay Valchev

unread,
Aug 24, 2019, 10:51:55 AM8/24/19
to Gremlin-users
Hello folks,

I am looking into Gremlin and graph DBs as a foundation for a Knowledge Graph focused project. I've started with the definition of initial schema of the graph - defining some static vertices and edges. Then would ingest concrete data entries (vertices) linked to the static ones. The decision I've made was to not rely on any graph DB provider schema definition specifics, but rather implement the schema directly in the graph as vertices and edges. The approach is similar to the one here

Now comes the topic of graph schema evolution - at some point of time I'd like to add new schema vertices and refactor the old ones. For that purpose there are multiple database refactoring tools for relational databases like liquibase or flyway that I used before. However, there are not so much in the graph DB domain. One prominent is liquigraph for Neo4j, but I found nothing based on Gremlin except for this one, which seems promising, but not active/maintained or used.

So, could you point me to some change management tools, based on Gremlin? Or share some strategies for evolving the graph schema...

Thank you,
Nikolay

Ryan Wisnesky

unread,
Aug 24, 2019, 1:38:38 PM8/24/19
to gremli...@googlegroups.com
Hi Nikolay,

If you are willing to think of directed multi-graphs as free categories, then based on your description here and the links provided I think functorial data migration [http://categoricaldata.net] may apply to your scenario. There are people on this list more qualified to speak to that than me, probably.

Some of that theory is being 'ported' to algebraic property graphs in a paper we can provide off-list; in that framework, you can represent graph evolution as a sequence of graph morphisms (of various kinds) G1 -> ... -> GN. Once you can do that, you can define schemas, evolutions of them, and other constructions. The tooling support here is primitive compared to the more general setting of categories mentioned above, however.

Ryan
> --
> You received this message because you are subscribed to the Google Groups "Gremlin-users" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to gremlin-user...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/gremlin-users/9bf116b6-35e9-45fe-9ced-e498b31e693b%40googlegroups.com.

Stephen Mallette

unread,
Aug 26, 2019, 6:45:34 AM8/26/19
to gremli...@googlegroups.com
There definitely aren't many schema management tools in the graph world that I'm aware of, at least not on the level of what exists for RDBMS. I'd imagine that for TinkerPop one mostly relies on scripts executed in some form of administrative fashion It's not especially nice or well automated in a structural way but I'm pretty sure that's what folks have always done. TinkerPop itself doesn't focus on "tools" all too much and given that TinkerPop 3.x and earlier versions have also not concerned themselves with trying to create an agnostic approach to graph schemas we have typically relied on graph providers and their tools as well as third-party development to help build out this space.





--

Joshua Shinavier

unread,
Aug 26, 2019, 11:51:16 AM8/26/19
to gremli...@googlegroups.com
Hi Nikolay,

As Stephen says, there hasn't been a lot of emphasis on a formal, vendor-agnostic property graph data model and schema language in TinkerPop. IMO, this is something we should, and can fix. Developers love the simplicity of property graphs (vis-a-vis RDF), but when you start getting serious about "knowledge graphs", the lack of a schema language really becomes a hindrance, as you have found.

In general, I don't think it is necessary to embed a schema representation in the graph database (so long as you have some external representation, like a set of files, which is in sync with it). However, back when we were using JanusGraph (which does have an embedded schema) at Uber, we used the approach of computing a semantic diff between consecutive versions of a schema, which were then used to patch the database as we moved from one version to the next.

Stay tuned for more information on the schema language we developed at Uber, and for the paper ("Algebraic Property Graphs") Ryan mentioned, which formalizes the graph data model. It has been a goal of mine for a while now to provide a common schema API in TinkerPop4, and I think we are pretty close.

Josh


Nikolay Valchev

unread,
Aug 28, 2019, 6:31:41 AM8/28/19
to Gremlin-users
Hi Josh, hi Ryan,

Thanks for the resources! You are doing really amazing work! Looking forward to see certain aspects embodied in TinkerPop 4. Btw, could you share what are the plans for that? The latest info I see here is maybe not full or outdated. Are there some plans to add further semantic restrictions (OWL style) and inference API?

I agree that embedding the schema in the graph is not necessary. However, it might be beneficial. Why? In our particular case (which is a standard for a knowledge graph system) we have two-fold usage of the graph schema:
  1. Semantic validation - validate new incoming changes (on transactional level) to the graph or via scheduled integrity checks that might scan the whole graph in OLAP manner.
  2. Inferencing - derive new relations, based on the existing ones and some semantic rules. 
Our first naive approach is to define the schema as embedded in the graph and implement 1. and 2. via gremlin queries on the graph. On the contrary, if the schema is defined outside the graph then the validation and inferencing would be theoretically more complex. I say theoretically, because we might have some problems expressing the validations and inferencing purely as queries, which read the schema and then apply it to the data at hand. Time would show if the approach falls short for that.


Josh, as I get it your approach currently at Uber is to have a proprietary schema definition language (yaml-based) for your models. Then there are dedicated generators for the other schema representations. Are you then generating graph db specific schema or some custom schema validation logic for the Property Graph format? (I'm referring to the slide here)

--
Nikolay 


On Monday, August 26, 2019 at 6:51:16 PM UTC+3, Joshua Shinavier wrote:
Hi Nikolay,

As Stephen says, there hasn't been a lot of emphasis on a formal, vendor-agnostic property graph data model and schema language in TinkerPop. IMO, this is something we should, and can fix. Developers love the simplicity of property graphs (vis-a-vis RDF), but when you start getting serious about "knowledge graphs", the lack of a schema language really becomes a hindrance, as you have found.

In general, I don't think it is necessary to embed a schema representation in the graph database (so long as you have some external representation, like a set of files, which is in sync with it). However, back when we were using JanusGraph (which does have an embedded schema) at Uber, we used the approach of computing a semantic diff between consecutive versions of a schema, which were then used to patch the database as we moved from one version to the next.

Stay tuned for more information on the schema language we developed at Uber, and for the paper ("Algebraic Property Graphs") Ryan mentioned, which formalizes the graph data model. It has been a goal of mine for a while now to provide a common schema API in TinkerPop4, and I think we are pretty close.

Josh


On Mon, Aug 26, 2019 at 3:50 AM Stephen Mallette <spmal...@gmail.com> wrote:
There definitely aren't many schema management tools in the graph world that I'm aware of, at least not on the level of what exists for RDBMS. I'd imagine that for TinkerPop one mostly relies on scripts executed in some form of administrative fashion It's not especially nice or well automated in a structural way but I'm pretty sure that's what folks have always done. TinkerPop itself doesn't focus on "tools" all too much and given that TinkerPop 3.x and earlier versions have also not concerned themselves with trying to create an agnostic approach to graph schemas we have typically relied on graph providers and their tools as well as third-party development to help build out this space.





On Sat, Aug 24, 2019 at 10:51 AM Nikolay Valchev <nvva...@gmail.com> wrote:
Hello folks,

I am looking into Gremlin and graph DBs as a foundation for a Knowledge Graph focused project. I've started with the definition of initial schema of the graph - defining some static vertices and edges. Then would ingest concrete data entries (vertices) linked to the static ones. The decision I've made was to not rely on any graph DB provider schema definition specifics, but rather implement the schema directly in the graph as vertices and edges. The approach is similar to the one here

Now comes the topic of graph schema evolution - at some point of time I'd like to add new schema vertices and refactor the old ones. For that purpose there are multiple database refactoring tools for relational databases like liquibase or flyway that I used before. However, there are not so much in the graph DB domain. One prominent is liquigraph for Neo4j, but I found nothing based on Gremlin except for this one, which seems promising, but not active/maintained or used.

So, could you point me to some change management tools, based on Gremlin? Or share some strategies for evolving the graph schema...

Thank you,
Nikolay

--
You received this message because you are subscribed to the Google Groups "Gremlin-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gremli...@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "Gremlin-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gremli...@googlegroups.com.

Joshua Shinavier

unread,
Aug 28, 2019, 2:29:26 PM8/28/19
to gremli...@googlegroups.com
Hi Nikolay,

Responses inline.


On Wed, Aug 28, 2019 at 3:31 AM Nikolay Valchev <nvva...@gmail.com> wrote:
Hi Josh, hi Ryan,

Thanks for the resources! You are doing really amazing work! Looking forward to see certain aspects embodied in TinkerPop 4. Btw, could you share what are the plans for that? The latest info I see here is maybe not full or outdated.

Too early to say for sure, but I expect that APG will become the foundational data model for TP4. Based on my conversations with Marko, I believe mm-ADT is to remain a separate entity for now, and a little later on, we will use the commonalities of the two frameworks to bridge the gap (APG being closer to the current, informal property graph data model of TP3, and mm-ADT stretching the boundaries a little more). Don't hold me to that.

 
Are there some plans to add further semantic restrictions (OWL style) and inference API?

TinkerPop has always avoided being prescriptive about semantics, which is left entirely to the application. That approach has served TP 0 through 3 well, and I would be surprised to hear anyone suggest that we change things now. However, a proper schema language offers a lot more in the way of structural constraints. Well-defined data types for edges and properties allow you to define more precisely which types of elements can be connected with which other types of elements and primitive values, and to perform type inference similar to what you find in statically typed programming languages. tl;dr restrictions yes, inference yes, semantics no.

That being said, there is an interesting conversation going on right now between Ryan (co-author of the APG paper) and Henry Story (SemWeb developer currently doing a PhD involving category theory) about CT and RDF semantics. Again, I think core TinkerPop should concern itself only with graph structure, but there are certainly opportunities for bringing the data model together with formal knowledge representation and inference.

 
I agree that embedding the schema in the graph is not necessary. However, it might be beneficial. Why? In our particular case (which is a standard for a knowledge graph system) we have two-fold usage of the graph schema:
  1. Semantic validation - validate new incoming changes (on transactional level) to the graph or via scheduled integrity checks that might scan the whole graph in OLAP manner.
  2. Inferencing - derive new relations, based on the existing ones and some semantic rules. 
Our first naive approach is to define the schema as embedded in the graph and implement 1. and 2. via gremlin queries on the graph. On the contrary, if the schema is defined outside the graph then the validation and inferencing would be theoretically more complex. I say theoretically, because we might have some problems expressing the validations and inferencing purely as queries, which read the schema and then apply it to the data at hand. Time would show if the approach falls short for that.


OK. You are drawing on both ABox and TBox data in your Gremlin queries, to borrow some ontology terms. That is not how I anticipate APG schemas typically being used in TinkerPop, but I am not going to say it is not a good approach. Often, if you control the mapping of source data into the graph, you can simply assume that it is valid according to the schema. The schema allows you to construct efficient traversals at compile time, and does not need to be queried at run time.

 
Josh, as I get it your approach currently at Uber is to have a proprietary schema definition language (yaml-based) for your models. Then there are dedicated generators for the other schema representations. Are you then generating graph db specific schema or some custom schema validation logic for the Property Graph format? (I'm referring to the slide here)

Not proprietary for too much longer, but yes that is accurate. At present in the knowledge graph project at Uber, we actually have an additional YAML format which is used in connection with property graph data models. The tooling carries schemas between the model-agnostic ("logical") model (which is APG with some additional bells and whistles like collection and enum types) and the property graph format, along with all of the others. I would like to implement a similar, open-source and JVM-based framework for TP4.

Josh


 
To unsubscribe from this group and stop receiving emails from it, send an email to gremlin-user...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/gremlin-users/8eee642b-d157-4f65-ba01-78af610cd519%40googlegroups.com.

Ryan Wisnesky

unread,
Aug 28, 2019, 3:24:16 PM8/28/19
to gremli...@googlegroups.com
I think that what is being discussed is not whether schemas should exist inside or outside of graphs, but whether every graph must have a schema, and if so, is there a unique most general one, and if so, what is the complexity of constructing it, and then, what can you do with it once you have it - typical questions of type inference, lifted to the level of graphs. As schemas become more complex, (easy) type inference gradually turns into (hard) proof search.

As for APG specifically as defined in the paper, its (completely standard) product-sum data has

- a sound and complete principal type inference system based on Haskell-style qualified types: https://pdfs.semanticscholar.org/e529/13a157b06a95951e19cde54a776e16adee11.pdf

and it can be viewed as a deductive system / logic via the Curry-Howard correspondence:

- a decidable entailment relation: http://www.tac.mta.ca/tac/volumes/8/n5/n5.pdf

Other graph data models and schema languages would presumably have different type inference and entailment properties, and these properties provide an axis along which to compare data models.

For what most would call data integrity constraints proper, such as functional dependencies or primary/foreign keys, one must add a constraint language on top of paper-APG, and the APG model itself says nothing about what further constraint languages to use, but being completely formal, it does allow you to prove properties about how any particular constraint language behaves when used with APG. For example, you can show in APG that if, in two APGs G1 and G2, every element has a primary key that does not mention a label, then there is at most one morphism, a mono-monorphism (inclusion), from G1 to G2. Results from relational database theory also generalize to paper-APG, so you can e.g., check that TGDs and EGDs hold on APGs by running conjunctive queries and comparing the results.
> To unsubscribe from this group and stop receiving emails from it, send an email to gremlin-user...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/gremlin-users/8eee642b-d157-4f65-ba01-78af610cd519%40googlegroups.com.

Nikolay Valchev

unread,
Aug 29, 2019, 11:35:54 AM8/29/19
to Gremlin-users
Thanks for the input folks!

We'd follow for now our current graph built-in schema approach to see how feasible it is.

Additionally we would be ingesting the data from set of parquet files/topic, where the data is already cleansed and compliant to a schema. Therefore, the other idea we would explore is to rely on that parquet schema as single source of truth. Then introduce some configuration mapping between the parquet schema and the expected Vertices and Edges for the graph model. This config would be used for generalized ingestion in the graph, this way enforcing the schema on write. The inferencing and reasoning would be implemented via code, but could be again generalized via configuration with a proper axiom and mapping syntax.

In both approaches the schema change management (evolution) should be solved on application level. It should involve explicit schema change or translation mapping reconfiguration. In both cases there might be need for migration of existing data as well. 

Really looking forward to TinkerPop 4 and the new stuff there!

Thanks,
Nikolay

Petko Boqdjiev

unread,
Oct 23, 2019, 11:03:50 AM10/23/19
to Gremlin-users

Not proprietary for too much longer, but yes that is accurate. At present in the knowledge graph project at Uber, we actually have an additional YAML format which is used in connection with property graph data models. The tooling carries schemas between the model-agnostic ("logical") model (which is APG with some additional bells and whistles like collection and enum types) and the property graph format, along with all of the others. I would like to implement a similar, open-source and JVM-based framework for TP4.

Hi,

I am pretty interested on trying your tooling for model-agnostic model. 
In our project we need to have a single source of truth about the data model because many entities are involved in our process.
Is there any update on the open source status of this?

Best regards,
Petko   

Joshua Shinavier

unread,
Oct 30, 2019, 3:01:59 PM10/30/19
to gremli...@googlegroups.com
Hey Petko,

Just a quick follow-up. Your question prompted me to get the necessary approval and finish writing the proposal for making Uber.Schema open source. I have created a branch in which I have removed all Uber-specific code, and all dependencies with incompatible licenses. This also led me to realize that the Pandoc mapping (which we use for generating schema docs) has to be replaced, as Pandoc is GPLv2.

There is now a code review process which may take several weeks. Meanwhile, the new Scala API is looking for a home. I am going to be able to start committing to a branch of the TinkerPop repo if appropriate.

Josh





--
You received this message because you are subscribed to the Google Groups "Gremlin-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gremlin-user...@googlegroups.com.

Petko Boqdjiev

unread,
Oct 30, 2019, 6:50:47 PM10/30/19
to gremli...@googlegroups.com
Hi,

Is there a place where i can take a look at the solution, or i should wait for the code review to finish?
Also thank you for the quick response!

Best regards,
Petko Boyadzhiev 

Stephen Mallette

unread,
Oct 31, 2019, 6:37:39 AM10/31/19
to gremli...@googlegroups.com
Hi Josh, regarding:

> I am going to be able to start committing to a branch of the TinkerPop repo if appropriate.

There's been a fair bit of discussion here on the user list about this work, but if you're getting more serious about committing something to TinkerPop, please begin a DISCUSS thread on the dev list that summarizes/explains your plans. All project related discussion and decisions needs to happen on that Apache list to be official. It's cool to get a wider audience view here on users but we always have to bring that conversation back to dev later. 

Thanks

Joshua Shinavier

unread,
Oct 31, 2019, 1:55:58 PM10/31/19
to gremli...@googlegroups.com
Hi Stephen,

Agreed. I have just gotten an informal (but probably definitive) go-ahead to start committing this code to TinkerPop. No code will actually be committed until that becomes an official go-ahead from both Uber and the TinkerPop PMC.

Petko, I can't share anything publicly at this time <wink>.

Josh


Reply all
Reply to author
Forward
0 new messages