how to model data in graph databases (specifically cayley)

1,006 views
Skip to first unread message

neonst...@gmail.com

unread,
Sep 12, 2014, 11:08:09 AM9/12/14
to cayley...@googlegroups.com
hi,

my question is not completely cayley-specific although i think that the solution for one graph database is not always the same for another so in that sense my question is related to cayley specifically.

how should i model my data using a graph database (specifically cayley)?

my initial thought is that i use the graph database to model relationships and maintain another database for the details of an entity - e.g. A views B is something modeled in a graph but A and B in the graph are simply identifiers that can be used to retrieve the full representations from another database. i've seen that some graph databases allow you to store arbitrary data on nodes and edges but if i'm not mistaken, cayley doesn't allow that. is that right?

this naive approach of mine leaves me struggling with how i can associate data with edges - e.g. A views B and i want to store the time that the viewing happened.

so, i'm wondering if my approach is fundamentally flawed and maybe i could get some help to shift my thinking to be more graph-oriented. i realize this is such a broad, general question and i'd be happy to be referred to sites, blogs, books, etc to read more about this.

thanks,

ben...

Barak Michener

unread,
Sep 12, 2014, 5:19:34 PM9/12/14
to neonst...@gmail.com, cayley...@googlegroups.com
Hi Ben,

So, great question, with a long and ultimately philosophical answer (when it gets to schema modelling). But today we'll avoid philosophy and talk about how these things are largely done. Let's talk a little about the Freebase (and roughly RDF) way of doing things, which Cayley follows.

A and B are identifiers, and that's fine. In fact, more than fine, it's the right way -- they're fairly unique IDs to a specific concept, and their human-readable names are stored somewhere on an edge (eg, from the Freebase world, [</m/02mjmr> </type/object/name> "Barack Obama" .] ) -- This way if there are multiple "Barack Obama"s (or "John Smith"s, more likely) they have different identifiers but share a name -- which makes sense in reality as well (different people who share a name)

So you could use this ID to store more data in another database, but you can also store data in the graph this way. Say, [</m/02mjmr> </people/person/height> 1.85 .] -- knowing that the schema means "height in meters", this is how we associate Obama with his height. You may normally think of this as a row in a table, eg,
  id         |  height  |  name
-------------------------------------
  /m/02mjmr  |  1.85    |  Barack Obama

so each triple* is roughly how tabular data gets stored. There's more to the meanings behind your choice of predicate, but I'm going to waive that for now.

Some graph stores allow arbitrary data on nodes, as Key/Value pairs. I personally disagree with that decision a lot, as any key you wish to put on a node can be "promoted" to an actual predicate and stored as a full triple. This way, you also get join-through-value for free (eg, people with the same name) without custom indexing. That, and there's no temptation to do something now which you'll actually promote later (people like to do this with "has genre" and "node type" quite often, which again, are useful as triples).

But don't just take my word for it: here's the slides and notes for a quite accurate talk I saw on Neo4j by one of the guys at FiftyThree here in NYC, with lessons learned using graphs in the wild: http://aseemk.com/talks/neo4j-with-nodejs
Pay special attention to slides 34 and 35 where they add "
Connected data ⇒ nodes, not props" to What We Learned, which is the same argument from a different angle (but the whole talk is great).

So then you ask, well, yeah, that's great for nodes, but what about adding data to links, like "time that the viewing happened"?

And for ephemeral data (eg, large logs) I might start to hesitate about storing it all in a graph (eg, the class of data where you're getting a bunch of click statistics and if a few drop, you don't care) -- not that you couldn't, but that's less of what graphs are good at -- but the problem still stands. So let me reframe the question as follows:

"Okay, so I have a relationship of "A adopted_child B" -- how do I add information about the adoption to the "adopted child" link?"

And the sentence betrays the Freebase answer -- "the adoption" is a noun. It's a concept unto itself. Freebase called this a CVT, sometimes they're called anonymous nodes, but the net result is the same. Promote the link to it's own concept. The following triples might make sense:

A adoptions CVT1 .
CVT1 type adoption .
CVT1 child B .
CVT1 on_date 9/12/2014 .
CVT1 adoption_organizer ....

and so on. Another good rule of thumb being that links should always point out of a CVT, so flip that first triple to be

CVT1 adoptive_parent A .

...but then you say, well, that sucks, now what was a simple relationship is now two relationships. In order to get adopted children, I need to run a query like:

g.V("A").In("adoptive_parent").Out("child").All()

Which does work. This is where Freebase actually just kind of threw up it's hands and expected you to inspect the schema (and had reverse properties and other such things to ease the pain). So what does Cayley do about this? Today, nothing special. But there's a huge opportunity for learning from the past here; rather than implement reverse properties, long term, the right approach is to invent some Cayley schema to store morphisms. Because already today you can do this query:

var adoptive_children = g.M().In("adoptive_parent").Out("child")
g.V("A").Follow(adoptive_children).All()

But if you could write some more data in your graph (graphs should be self-describing, another topic for another day) to declare this path as one you wanted to shorthand, the resulting query would be as pretty as the original. (Note for old FB fans: Reverse property is just a special case of this for g.M().In("forward_property"), baked into MQL.)

Hopefully this discussion helps a bit. Feel free to follow up with more questions!

(Incidentally, someday I'd love to write a book on the subject of graphs, schema, and the like, but there's more code to write.)

--Barak

* Triples vs Quads is another discussion -- the graph label. Officially, Cayley is a quad store, but everything I just said holds if you limit yourself to triples (which Cayley handles naturally. It's a proper subset).

On 09/12/2014 11:08 AM, neonst...@gmail.com wrote:

neonst...@gmail.com

unread,
Oct 13, 2014, 5:29:32 PM10/13/14
to cayley...@googlegroups.com, neonst...@gmail.com
thanks for your help barak.

i'm having some success pursuing the idea of using CVTs to provide a way to store more descriptive data about a relationship.

i now want to follow up with a few more questions about manipulating data in a graph db.

* how should i approach updating data via the HTTP API? for example, let's say that i've got a person with predicates like height and weight and over time those values change so i want to update them. i'm assuming that i need to explicitly delete the quads that represent the current values and then write quads that represent the new values. if that's so, i have 2 pain points here
1. i need to know the previous values in order to delete them
2. updating is not atomic
do you already have something in mind for the future of cayley that may address these points?

* how do i easily delete an "entity"? continuing with the person example, if a person has predicates for height, weight, name, dob, and follows then it seems like currently i would need to query for all of those predicates to get the objects in order to be able to generate the necessary quads to delete those values. in addition, if the follows predicate was a CVT with predicates like follower, following, and, since_date then i would also need to query for those in order to delete them. i've seen https://github.com/google/cayley/issues/123 and the corresponding PR to try and move that forward and the thought just occurred to me that for the HTTP API maybe an MQL-based delete could be useful for describing which predicates to delete.

[{
"id": "/person/123",
"height": null,
"weight": null,
"name": null,
"dob": null,
"follows": [{ "follower": null, "following": null, "since_date": null }]
}]

in case i've got my head tied in knots and that MQL is something that seems like nonsense, the idea is that any quads matching the MQL query, would be deleted.

thanks for your help.

ben...

haskel...@gmail.com

unread,
Oct 28, 2016, 3:57:24 AM10/28/16
to cayley-users, neonst...@gmail.com
在 2014年10月14日星期二 UTC+8上午5:29:32,Ben Hockey写道:
Hi, have you found some approach? I have exactly the same problem that need to update property of a relationship
Reply all
Reply to author
Forward
0 new messages