Hey Marko,
Thanks for the feedback and enthusiasm. IMO, the starting point for a standard has been staring us in the face: Graph.Features, or rather Graph.Features++. This has already been vetted to a certain extent by the community. Although many of the features deal with data reading and writing, transactions, etc. there is a subset that essentially defines a graph data model. What are the primitive data types of the model? Which basic structures / element kinds are supported? What sort of set-theoretical constraints exist among the types?
If you peek into other property graph schema frameworks -- like
Neo4j's or
JanusGraph's -- you find additional vocabulary for existence constraints, more fine-grained notions of cardinality and multiplicity, enumerations of labels and keys, and slightly different sets of primitive data types. All of this vocabulary is potential raw material for a standard -- i.e. a language for describing property graph data models. However, I also think it is important to choose a solid theoretical foundation rather than just picking and choosing among commonly used terms. I was excited to learn that you had been thinking about algebras, as I see the potential foundation in algebraic data types, and of course category theory.
For a start, data models will be much more portable if we can standardize on a set of core data types, and it should not be hard to come up with a reasonable consensus on these types. At Uber, we have just called the basic types Boolean, String, Integer, Float, etc. and provided parameters for numeric precision and signedness. Define a reasonable set of primitive types, then think about bindings for XML Schema, interface description languages, and individual programming languages.
As I suggested in my previous email, we can also think of vertex labels, edge labels, and property keys as types. Vertex labels are atomic, and IMO every vertex should have exactly one, just as every primitive value has exactly one data type. You can think of unlabeled vertices as having a designated "null" label. Some graph back-ends may not support the null label, and this should be explicit. On the other hand, edge labels and property keys are not atomic; each one has projections to two other types. In terms of algebra, an edge label or property key contains a product, or ordered pair, of two types. Btw. it is a short step from there to hyper-relationships, which are tuples, i.e. the product of any number of other types. Likewise for relational schemas.
It is also tempting to include sum types in the schema language. For example, the sum of the unit type with any other type is an "optional". The sum of any two types is an "either/or", which enables pattern matching on types. Pattern matching is currently not typical in property graph databases, but there are some really interesting possibilities here, and it enables crossover between property graphs and languages like Protobuf, Thrift, and Avro (which are the bread and butter of data exchange in many places). Among other things, this makes it possible to expose bigger chunks of enterprise data as a graph.
With respect to query languages, there are some concepts from functional programming which are a great fit for data models based on algebraic data types, and the good news here is that Gremlin already embodies some of them. For example, parametric polymorphism makes queries "category agnostic" -- a mapping, or in category theoretical terms, a functor, literally cannot depend on the identities of the objects it deals with, and must preserve the compositional structure of the arrows between them.
While the
gremlin-scala wiki points out that Gremlin traversals are not actually monads, in many cases you can think of them that way. A monad is a mapping that consume a thing, like a vertex of a particular label, and produce a container full of other things -- like an iterator of vertices of another label -- together with a rule for binding steps together so as to feed the output of one step into the input of the next. Some steps are functionally pure, whereas others carry state. It would be really interesting to define a purely functional subset of Gremlin, although in my opinion the case for standardization is not as strong for graph query languages as for data models.
Josh