[PROPOSAL] Strict Schema

260 views
Skip to first unread message

Florian Hockmann

unread,
Dec 14, 2017, 8:35:33 AM12/14/17
to JanusGraph developers

Currently, the schema for JanusGraph is basically only a list of allowed labels (for vertices and edges) and available properties. What's missing in my opinion is the option to specify which vertex and edge label can have which property keys and which edge labels can connect which vertex labels.


Just to give an idea of what I mean, here are two examples for the Graph of Gods:

  • Gods can have the property keys name and age, whereas locations only have a name (no age allowed).
  • The edge label brother can connect gods, but not a god with a location.

This is of course only a toy graph, but I suspect that most real-world data models contain similar constraints.

When we allow users to enforce those constraints inside of JanusGraph then they can be sure that no user of their database can insert data that doesn't comply with these constraints (e.g., a brother edge that connects a god with a location). So, a strict schema ensures that the graph is in a consistent state with respect to those constraints.[1]

In schema-less databases this schema is often included implicitly in the client applications as those applications need to know how they can access the data. So even if the database is schema-less, there is still an implicit schema. This means that updating the (implicit) schema isn't really easier without having it explicitly defined in the database as it needs to be changed in the client applications.

Having this schema explicitly defined in JanusGraph also makes it easy to tell new users what kind of data they can expect, e.g., they know that a location can't have an age, but a god can. This would also allow tools to fetch the schema from a JanusGraph instance to visualize it. Such a visualization makes it much easier to reason about the schema as it provides an easy to understand representation of it.

Finally, an explicit schema would also allow OGM (object graph mapper) tools to fetch the schema from JanusGraph and translate it into entity classes which makes it possible to only have the schema defined in just one place (DRY principle).

So, in short, I propose that JanusGraph gets a strict schema, either as the only option or as an additional option for backwards-compatibility with existing deployments and their data models.

Regards,

Florian


[1] We actually had the problem with our JanusGraph database that it contained data which shouldn’t be possible. Our schema models the network traffic of malware samples, so we have edge labels like SampleToDomain or SampleToIp that connect samples with domains or IP addresses they contacted. At some point we found edges in our graph that connected samples with domains and had an edge label of SampleToIp which is problematic as our applications of course expect an IP address when they follow a SampleToIp edge.

Ted Wilmes

unread,
Dec 14, 2017, 11:49:25 AM12/14/17
to JanusGraph developers
Hi Florian,
I think this would be a very worthwhile addition. Provided folks are in agreement, I think a good next step would be to spec out the additions to JanusGraphManagement, a format for the schema definition that could be ingested by callers to infer the current schema (object graph, json, etc.), and also to define the interaction of this new feature and user queries, or in other words, what schema enforcement will look like.

Thanks,
Ted

David Pitera

unread,
Dec 14, 2017, 12:07:38 PM12/14/17
to Ted Wilmes, JanusGraph developers
I haven't fully thought about this yet, but my initial reaction is that (1) the schema enforcement should be opt-in and (2) we want to avoid performance degradation in the case that schema enforcement is not enabled. Of course schema enforcement will require some overhead.

It seems that the notion of a schema in JanusGraph is currently mute; it does enforce some dataTyping, but mostly seems that the vertexLabels/edgeLabels/propertyKeys are created mostly for their use in index definitions.

--
You received this message because you are subscribed to the Google Groups "JanusGraph developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to janusgraph-dev+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/janusgraph-dev/69428dda-baa3-489c-99a1-c316e0728e09%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Austin Sharp

unread,
Dec 14, 2017, 12:52:08 PM12/14/17
to JanusGraph developers
In our use of Titan and JanusGraph for several years now we have had a need for this. In fact, over two years ago we built our own ORM-like, schema-enforcement layer and have been using it ever since. So I think this would be an exceptionally useful addition for any serious users of JanusGraph, and would have saved us a lot of work and pain.
To unsubscribe from this group and stop receiving emails from it, send an email to janusgraph-de...@googlegroups.com.

ankur...@gmail.com

unread,
Dec 18, 2017, 12:07:01 AM12/18/17
to JanusGraph developers
My thought schema enforcement should work at times graph CRUD only not in traversals that too optional based on use case.

~
To unsubscribe from this group and stop receiving emails from it, send an email to janusgraph-de...@googlegroups.com.

Rainer Pichler

unread,
Dec 19, 2017, 11:55:01 AM12/19/17
to JanusGraph developers
We at CELUM also put a custom model on top of JanusGraph that supports a type system and multi-inheritance for vertex/edge types.

The global scope of property key definitions forces us to define all properties' data type as Object as same-named properties on elements of different types might have different types
(this also revealed the issue https://groups.google.com/forum/#!topic/janusgraph-dev/3KIDmHuTcwo). Overcoming this limitation should then reduce storage overhead when we can work with concrete property value types.

We solved the traversal-time schema enforcement by having a (compile-time) type-safe query language on top of Gremlin that also implements the type inheritance logic (Intro: https://www.celum.com/en/blog/technology/a-querys-quest). Type inheritance is modelled via additional properties. Soon, I will release a blog article that elaborates on one of our use cases and highlights the benefits of a strict schema and type-safety.

-Rainer Pichler
https://twitter.com/rainerpichler

Ted Wilmes

unread,
Jan 7, 2018, 10:11:55 AM1/7/18
to JanusGraph developers
Hello,
That's helpful input, Ranier, and brings up a good question as to how far we want to 
go with this. I think one option would be to keep the PropertyKey type definitions as 
they are now (global), but allow them to be mapped to specific vertex and edge 
labels. The second would be more inline with what you're suggesting, if I'm understanding 
correctly, which would be properties are only created in the context of a specific vertex
 or edge label. This would be much more familiar to the way folks are used to using 
an RDBMS, eg. the "name" property on Person, could be of a different type than 
the "name" Property on a "Building" vertex. I think this could be particularly helpful 
if we add other constraints in later. For example, say we have an "age" property 
on a Person vertex and allow a user to specify a min & a max, or a not-null. 
Ideally, they'd be able to specify a different constraint in the context of another 
vertex/edge label. This could still be done with a global propertykey definition, but the 
constraints then would be tied to the element label/propertykey tuple vs just the 
unique propertykey.

I had put together some examples of the first simpler approach, but now that I 
think about it, I'd like us to determine how far down this rabbit hole we should 
go on the first pass of this schema support work with the high level options being:

1) Define property keys globally as they are now, but allow the user to map 
them to vertex and edge labels. The implications is there is only one of each 
property key (e.g. name is always a String)

2) Define property keys in the context of a specific vertex or edge label. There 
can be more than one property key with the same name. Think column definitions in an RDBMS.

Historically, the first would be adequate for me in the majority of cases, but the 
flexibility of the second would be quite powerful.

What do you all think would be most helpful based upon your day-to-day modeling work?

Thanks,
Ted

Florian Hockmann

unread,
Jan 8, 2018, 7:58:39 AM1/8/18
to JanusGraph developers
Hi Ted,

I think the second option would be the better one in the long term as it allows to define property keys again for different vertices for which they have different meanings. We currently often include the vertex label again in property keys as a workaround to avoid problems with adding indexes of property keys that already existed for new vertex labels. So we have property keys like CityName, CountryName, and so on. That shouldn't be necessary anymore with your second option.

However, it's probably much easier to implement the first option as it's closer to the way property keys currently work in JanusGraph. Since even the first option would bring most of the benefits of a strict schema I would suggest that that should be implemented first and the second one in a later version.

Regards,
Florian

Ted Wilmes

unread,
Jan 10, 2018, 6:33:42 PM1/10/18
to JanusGraph developers
I agree with your logic. I'm inclined to work through the first to see if we run into any other pitfalls and then I think that will ultimately help guide us if we 
decide to make PropertyKeys local to specific vertex and edge types. I just added an issue[1] with a first cut at the idea from an API standpoint for us to throw
darts at and update as needed.

Thanks,
Ted

raine...@gmail.com

unread,
Jan 11, 2018, 1:16:59 AM1/11/18
to JanusGraph developers
I also see it the same way as Florian.

As long as there are no property-specific constraints like string length, we can work with global property definitions. After all, the property name itself often implies it's data type across different element types and type conflicts are unlikely (e.g. Name -> String, Size -> Long).

But once there are such constraints (we implement them within our abstraction), element-specific property definitions could prove useful. For example, this information could be communicated to storage backends so that they can take advantage of reduced storage size and faster querying rather than seeing a serialized Java object (e.g. map String(255) to VARCHAR(255)).

-Rainer Pichler
https://twitter.com/rainerpichler

rahul.n.m...@verizon.com

unread,
Mar 2, 2018, 1:25:02 PM3/2/18
to JanusGraph developers
We are currently evaluating JanusGraph for our new project. And it happens to be that we just had discussion today morning about the exact same use case described by Ted here -


the "name" property on Person, could be of a different type than the "name" Property on a "Building" vertex.


We found this imp feature missing from JanusGrah and can impact our final decision. Obviously, we though about same workarounds as mentioned by Am Montag to prefix property key name with vertex label.

This will be very useful addition to JanusGraph.

raine...@gmail.com

unread,
Jul 23, 2018, 2:36:19 AM7/23/18
to JanusGraph developers
Finally, this is the article I promised: https://www.celum.com/de/blog/type-safe-graph-queries

At first it discusses a file-system like part of CELUM's data model. This could serve as practical input to this discussion. Then it shows in depth how a query can make use of this type-safe data model.

-Rainer Pichler
Reply all
Reply to author
Forward
0 new messages