Automatically Creating/Parsing inverse relationships

381 views
Skip to first unread message

Michael Fair

unread,
Sep 3, 2014, 2:57:37 AM9/3/14
to ne...@googlegroups.com
I did some searching but couldn't find any discussion on the topic yet.

Has anyone considered enabling multiple type labels for a relationship type where the label name used in the query implies the directionality of the relationship?  What are the caveats/pitfalls/complications of allowing multiple labels to refer to the same relationship type?

For example, assume we've created a relationship called "CHILD_OF".
(c)->[:CHILD_OF]->(p)

This has the implied inverse relationship of "PARENT_TO"
(p)<-[:PARENT_TO]<-(c)

Currently we'd have to define two separate relationships to be able to use either form of those queries to get back the same information.

I'm thinking it'd be really great if we could give relationships multiple type names; I'm thinking 2 to start; one for each direction when defining the link. 
In other words, in addition to having a forward direction name, we could also define the reverse direction name as part of the relationship type.


I don't have an exact syntax for representing this yet, but I'm thinking of using an optional '/' on the relationship name when assigning the relationship to the node.

For instance, similar pairings exist in practically all hierarchical relationships:
[:OWNER_OF/OWNED_BY {}]
[:CONTAINS/INSIDE_OF {}]
[:COVERED_BY/COVERING {}] 

as well as the roles of the parties in a two-way relationship/transaction:
[:FOLLOWING/FOLLOWED_BY {}]
[:EMPLOYS/EMPLOYED_BY {}]
[:ABOVE/BELOW {}]


I'd like to hear people's thoughts on syntax, implementation difficulty, and just comments in general on how useful or not this kind of naming could be.
Assuming it's useful, is it better to go for a full featured system that's actually useful and makes it super easy rather than a half featured system that's cumbersome and therefore doesn't actually get used by anyone?


I have some ideas for the implementation to get started that avoids any on-disk format changes (the basic concept is to store the labels exactly as typed above (complete with the slashes) as the actual, official, on-disk type name; then when loading the name into memory, add a "direction" bit to in-memory relationship types structure and parse out the on-disk name to create two entries that both refer back to the same relationship type identifier; just that one now implies using reversed terms).


Would people still be able to use the feature if the returned results were always expressed in terms of the forward relationship?
It's obviously ideal if the results preserve the name and direction used in the original query; so if we asked for "CHILD_OF" or "PARENT_TO" in the query, then that's what we'd get back.  However if the inverse labels are implemented as just a syntactic shortcut for reversing the terms on the forward query like I'm thinking, preserving the original context could get problematic. 
Would it still be useful if someone expressed "PARENT_TO" in the query, but then got back "CHILD_OF" relationship results?
I don't think it would because whatever was receiving the results would be expecting what it asked for, but I'd like to hear other people's comments on that.
(Assuming it's not useful, then I'd welcome ideas for how to keep the original context from the original query.)



Lastly, a slightly more complex syntax/implementation would be to allow multiple alternates and embed the direction directly in the alternate labels.
I was considering using multiple slashes with <- and -> on the front of the label to define the direction of the alternate links.

For example, these would all become different relationship types:
[:SIBLING//<-BROTHER/->SISTER {}]
[:SIBLING//<-BROTHER/->BROTHER {}]
[:SIBLING//<-SISTER/->SISTER {}]

[:SPOUSE/<-SPOUSE/<-HUSBAND/->WIFE {}]
[:SPOUSE/<-SPOUSE/<-WIFE/->WIFE {}]
[:SPOUSE/<-SPOUSE/<-HUSBAND/->HUSBAND {}]

This says that SIBLING and SPOUSE should treat the from/to nodes as is, as should all the labels using the "->" prefix.
All labels using the "<-" prefix should reverse the context of the from/to nodes when being used in a query.


Also, each of the labels is used multiple times.
E.g. when there's a query for a "SIBLING" or "SPOUSE" relationship that would become an alternation for all three of the respective relationship types.
Querying for "Brother", "Sister", "Husband", or "Wife" would likewise become an alternation for all of the appropriate relationship types.

The idea being we can express multiple names and the relationship direction those relationships imply.


Thanks,
Mike

Mark Findlater

unread,
Sep 3, 2014, 10:29:34 AM9/3/14
to ne...@googlegroups.com
Hey Mike I think that this is interesting and it reminds me of using the OWL inverseOf property relationship and other features of the inferencing engine (overviewspec). Or stepping further back (in my learning) defining rules in Prolog.

I guess that my question as a developer is why I would want the inverse relationship label defined? If I can write a CYPHER query I can represent the nodes and relationships in the query as I like, is it just the representation in the console that you find unintuitive, or is that when opening the console up to users it can be confusing to say if you want to find the parents of child X then you need to follow the HAS_CHILD relationship 'backwards'? I confess I would hide most of this behind an abstraction layer of some sort.

With the triple stores I have used there is usually an inference engine which then expose to queries the union of both the asserted triples (the data that you explicitly added) and the inferred triples (the ones that rules were used to create). The important thing about the inferred triples is that they are ephemeral meaning that they do not impact the underlying datastore (although you could choose the materialize them) which stops unnecessary bloat (at the cost of processing).

As these new relationships are sugar which is ultimately bloat would you ever want to persist them? I also suspect that the directional trick you suggest could have serious implications to the performance of the query engine (but that's just a hunch).

The Sibling/Spouse example has limited mileage for me without some clever chaining (e.g. how would I represent cousins? Harden SIBLING and PARENT/SPOUSE relationships and then chain SIBLING->PARENT->SIBLING->CHILD nodes, hardening along the way?) and again I think I would see more value in adding this a level above the underlying store.

Sporadic rambling done, as Neo is super fast at graph traversals (really??) I personally do not see much value add unless abstracted away from Core Neo, and until such time I would be happy to use Cypher to create the new relationships when required.

M

Michael Fair

unread,
Sep 3, 2014, 8:52:57 PM9/3/14
to ne...@googlegroups.com

 
On Wednesday, September 3, 2014 3:29:34 AM UTC-7, Mark Findlater wrote:
Hey Mike I think that this is interesting and it reminds me of using the OWL inverseOf property relationship and other features of the inferencing engine (overviewspec). Or stepping further back (in my learning) defining rules in Prolog.

This is definitely about making the schema/engine smarter about at "semantics".

 
I guess that my question as a developer is why I would want the inverse relationship label defined?  

I can see where a single app written by a single developer or two who are all the combined schema master, data integrity maintainer, and code author likely wouldn't need "help" to translate into their schema to query it; they wrote it and likewise if they want to extend the app, they can just write more code to extend the app; if they need a new relationship, they just invent it.  I'd argue that even at that small scale, while it's not "needed" it's still helpful.

 
With the triple stores I have used there is usually an inference engine which then expose to queries the union of both the asserted triples (the data that you explicitly added) and the inferred triples (the ones that rules were used to create). The important thing about the inferred triples is that they are ephemeral meaning that they do not impact the underlying datastore (although you could choose the materialize them) which stops unnecessary bloat (at the cost of processing).

Well storage-wise, the db is storing a numeric relationship type identifier, and then separately storing the relationship type label information. 

So storage-wise, adding multiple relationship type entries for the same type id isn't a problem.
The on-disk structure is totally capable of handling this, it's just changing the type of block pointed to in the RelatioshipTypeStore from String to String[].
It's the code that loads this from the disk that would need to be updated to understand it's getting a String[] back and not just a String.

I'm now however convinced that updating the code to detect a String versus a String[] in the RelationshipTypeStore is going to be way easier than all the splitting and parsing gymnastics I was thinking of earlier! :)
Though I still might encode the "direction" as part of the string.


As these new relationships are sugar which is ultimately bloat would you ever want to persist them? I also suspect that the directional trick you suggest could have serious implications to the performance of the query engine (but that's just a hunch).

We can query (a)<-[:REL]-(b) for the same performance as (a)-[:REL]->(b) so the only trick is getting the planner to know what's being expressed, and that's what I think the parser's job is.

All relationships in neo4j have a from/to direction by design.  The label for that direction is rather arbitrary created at design time. Being able to define the inverse labels for a relationship type eliminates that required design time arbitrary selection and I can't see how it has any performance impact.  The planner still knows it's looking for a numeric relationship id (e.g. 62), and what it needs to know about from and to; I suspect most of the heavy lifting on that was created in the parser.  From what I've gathered a "directionless" relationship match actually becomes two matches under the hood, one in each direction.
This would just make a similar kind of translation.
 
 

The Sibling/Spouse example has limited mileage for me without some clever chaining (e.g. how would I represent cousins? Harden SIBLING and PARENT/SPOUSE relationships and then chain SIBLING->PARENT->SIBLING->CHILD nodes, hardening along the way?) and again I think I would see more value in adding this a level above the underlying store.

You jumped the gun on me! :)
I'm actually working on a "relationships as sub-queries" plan and [:COUSIN], [:NIECE], [:NEPHEW] are perfect examples of that.
I guess I just ought to have simply proposed these named sub-queries as part of the initial discussion.  :)

It's cool that you used the PARENT/CHILD pairing in the example (n)-[:PARENT]->(p)-[:SIBLING]->(s)-[:CHILD]->(c). :)
I see using the PARENT/CHILD pair as way more straightforward to read/follow than (n)-[:PARENT]->(p)-[:SIBLING]->(s)<-[:PARENT]-(c). :)


So let's talk about how this can be used to create the (n)-[:COUSIN]->(c) query;
This would be a new kind of relationship type.
If the Cousin type captured a MATCH query like from above, then it would be a kindred spirit to a SQL VIEW or FUNCTION as part of the DB schema.  Again it's primarily just creating a syntactic shortcut for a larger query; but it's one that makes the database more comprehensible to those interfacing with it.

This new relationship type could even be used to create a really powerful result set reduction replacement function.
Imagine taking a graph result from the prior query and running the [:COUSIN] relationship replacement function on it.

So if we take a row from the results from the above query,
e.g. (r1) = "ME", [:PARENT], "MYDAD", [:SIBLING], "DAD'S BROTHER", [:CHILD], "DAD'S BROTHER'S DAUGHTER"

and run GraphMapReduce(r1, [:COUSIN]), then anywhere the intermediate nodes match the query provided by [:COUSIN] they get replaced with the relationship [:COUSIN]. 
Giving us:
(r2) = "ME"-[:COUSIN]->"DAD'S BROTHER'S DAUGHTER"

If the [:COUSIN] type also provided some kind of MAP/REDUCE type functions as part it's description/definition, then it could build even its own properties dynamically from the underlying properties on the intervening nodes.

This would be especially useful if we were to create a "correct" implementation of Cousin which really means everyone you are directly related to through marriage or parentage.  Your third cousin twice removed for example is still technically a "Cousin", the map/reduce function on "Cousin" could take the intervening Path P, count the number of Parent hops involved to get the cousin's "order" and then count the balance of the parent links up/down to get the cousin's "removal" and make those both properties part of the "COUSIN" relationship connecting us.

So the idea is that you pass a sub-graph or a result set to a named relationship type (which is actually a query).
It operates on each row in the result set, when the row matches the query, the links between the matching endpoint nodes are replaced by the relationship and the results of the relationship query become properties on the relationship.

So the final results of this would be that we started with this:
(r1) = "ME", [:PARENT], "MYDAD", [:SIBLING], "DAD'S BROTHER", [:CHILD], "DAD'S BROTHER'S DAUGHTER"

and reduced it to this:
(r3) = "ME"-[:COUSIN {order: 1, removed: 0}]->"DAD'S BROTHER'S DAUGHTER"
This, at least in my mind's eye, would involve adding a new field to the relationship type to persist across reboots (though technically it is still just a "String").

Mark Findlater

unread,
Sep 4, 2014, 9:26:39 AM9/4/14
to ne...@googlegroups.com
Hey Michael, I think I have entered a discussion that I am under qualified for, but I'll keep running! I think I'll jump the top section regarding the under the hood stuff as I do not know enough about how Neo maintains it's indexes, how the query/relationship/object caches work and whether there is any performance impact of creating twice as many relationships (I expect that if I do not get smarter with my data I would reach maximum capacity of Neo in 3 years, if I doubled the relationships I suppose I would half that, but that is very naive!). 

Running with a team of developers I would not expect all or even many of them to be operating on raw queries and similar to not producing inverse methods all over the place (although you could cite Commons StringUtils isEmpty/isNotEmpty type design) I expect people to be comfortable with negation, or applying more complex logical analysis. Which is what I meant about exposing the raw query interface to the world at which point I might expect less of what people are comfortable with!

I do think that it is a very interesting idea though and expressing (either through or above Core Neo) the semantics but not persisting them presenting a richer interface to all and sundry is a worthwhile discussion. If you could define the inferred relationships such that the planner could use them (he said :SIBLING but he means :SISTER or :BROTHER) or, as you touched on below defining the rules and then explicitly running a MapReduce type function to expose the data sound very interesting. Now I'll comment in line..

On Wednesday, 3 September 2014 21:52:57 UTC+1, Michael Fair wrote:

 
On Wednesday, September 3, 2014 3:29:34 AM UTC-7, Mark Findlater wrote:
Hey Mike I think that this is interesting and it reminds me of using the OWL inverseOf property relationship and other features of the inferencing engine (overviewspec). Or stepping further back (in my learning) defining rules in Prolog.

This is definitely about making the schema/engine smarter about at "semantics".

 
I guess that my question as a developer is why I would want the inverse relationship label defined?  

I can see where a single app written by a single developer or two who are all the combined schema master, data integrity maintainer, and code author likely wouldn't need "help" to translate into their schema to query it; they wrote it and likewise if they want to extend the app, they can just write more code to extend the app; if they need a new relationship, they just invent it.  I'd argue that even at that small scale, while it's not "needed" it's still helpful.

 
With the triple stores I have used there is usually an inference engine which then expose to queries the union of both the asserted triples (the data that you explicitly added) and the inferred triples (the ones that rules were used to create). The important thing about the inferred triples is that they are ephemeral meaning that they do not impact the underlying datastore (although you could choose the materialize them) which stops unnecessary bloat (at the cost of processing).

Well storage-wise, the db is storing a numeric relationship type identifier, and then separately storing the relationship type label information. 

So storage-wise, adding multiple relationship type entries for the same type id isn't a problem.
The on-disk structure is totally capable of handling this, it's just changing the type of block pointed to in the RelatioshipTypeStore from String to String[].
It's the code that loads this from the disk that would need to be updated to understand it's getting a String[] back and not just a String.

I'm now however convinced that updating the code to detect a String versus a String[] in the RelationshipTypeStore is going to be way easier than all the splitting and parsing gymnastics I was thinking of earlier! :)
Though I still might encode the "direction" as part of the string.


As these new relationships are sugar which is ultimately bloat would you ever want to persist them? I also suspect that the directional trick you suggest could have serious implications to the performance of the query engine (but that's just a hunch).

We can query (a)<-[:REL]-(b) for the same performance as (a)-[:REL]->(b) so the only trick is getting the planner to know what's being expressed, and that's what I think the parser's job is.

All relationships in neo4j have a from/to direction by design.  The label for that direction is rather arbitrary created at design time. Being able to define the inverse labels for a relationship type eliminates that required design time arbitrary selection and I can't see how it has any performance impact.  The planner still knows it's looking for a numeric relationship id (e.g. 62), and what it needs to know about from and to; I suspect most of the heavy lifting on that was created in the parser.  From what I've gathered a "directionless" relationship match actually becomes two matches under the hood, one in each direction.
This would just make a similar kind of translation.
 
 

The Sibling/Spouse example has limited mileage for me without some clever chaining (e.g. how would I represent cousins? Harden SIBLING and PARENT/SPOUSE relationships and then chain SIBLING->PARENT->SIBLING->CHILD nodes, hardening along the way?) and again I think I would see more value in adding this a level above the underlying store.

You jumped the gun on me! :)
I'm actually working on a "relationships as sub-queries" plan and [:COUSIN], [:NIECE], [:NEPHEW] are perfect examples of that.
I guess I just ought to have simply proposed these named sub-queries as part of the initial discussion.  :)

It's cool that you used the PARENT/CHILD pairing in the example (n)-[:PARENT]->(p)-[:SIBLING]->(s)-[:CHILD]->(c). :)
I see using the PARENT/CHILD pair as way more straightforward to read/follow than (n)-[:PARENT]->(p)-[:SIBLING]->(s)<-[:PARENT]-(c). :)
 
It is easier on the eye to read a query that draws you from left to right (in your second example above it does appear as if (s) should bee the subject of or attention), but we are discussing orders of beauty on a data interface that is already extremely intuitive (not that i am advocating procrastination) .


So let's talk about how this can be used to create the (n)-[:COUSIN]->(c) query;
This would be a new kind of relationship type.
If the Cousin type captured a MATCH query like from above, then it would be a kindred spirit to a SQL VIEW or FUNCTION as part of the DB schema.  Again it's primarily just creating a syntactic shortcut for a larger query; but it's one that makes the database more comprehensible to those interfacing with it.

I've banged on about not worrying about comprehensibility, because I think that we would be masking some of the super powers of graphs by concerning ourselves with directional semantics at the query interface level. However distilling larger traversals into a single pseudo relationship *could* make representing complex queries more manageable (whilst potentially masking some things you might have wanted, i.e cousin on which side?).
 

This new relationship type could even be used to create a really powerful result set reduction replacement function.
Imagine taking a graph result from the prior query and running the [:COUSIN] relationship replacement function on it.

So if we take a row from the results from the above query,
e.g. (r1) = "ME", [:PARENT], "MYDAD", [:SIBLING], "DAD'S BROTHER", [:CHILD], "DAD'S BROTHER'S DAUGHTER"

and run GraphMapReduce(r1, [:COUSIN]), then anywhere the intermediate nodes match the query provided by [:COUSIN] they get replaced with the relationship [:COUSIN]. 
Giving us:
(r2) = "ME"-[:COUSIN]->"DAD'S BROTHER'S DAUGHTER"

If the [:COUSIN] type also provided some kind of MAP/REDUCE type functions as part it's description/definition, then it could build even its own properties dynamically from the underlying properties on the intervening nodes.

This would be especially useful if we were to create a "correct" implementation of Cousin which really means everyone you are directly related to through marriage or parentage.  Your third cousin twice removed for example is still technically a "Cousin", the map/reduce function on "Cousin" could take the intervening Path P, count the number of Parent hops involved to get the cousin's "order" and then count the balance of the parent links up/down to get the cousin's "removal" and make those both properties part of the "COUSIN" relationship connecting us.

So the idea is that you pass a sub-graph or a result set to a named relationship type (which is actually a query).
It operates on each row in the result set, when the row matches the query, the links between the matching endpoint nodes are replaced by the relationship and the results of the relationship query become properties on the relationship.

So the final results of this would be that we started with this:
(r1) = "ME", [:PARENT], "MYDAD", [:SIBLING], "DAD'S BROTHER", [:CHILD], "DAD'S BROTHER'S DAUGHTER"

and reduced it to this:
(r3) = "ME"-[:COUSIN {order: 1, removed: 0}]->"DAD'S BROTHER'S DAUGHTER"
This, at least in my mind's eye, would involve adding a new field to the relationship type to persist across reboots (though technically it is still just a "String").

 This is the interesting bit for me, something that you already implied it by calling your Class above the GraphMadReducer would be standardising on how analytic or inferencing models could be applied to your graph and then using this standardised model to build relationships with Hadoop or Spark or your new Apache incubating distributed processing for Graph stores project. Having a local MapReduce function is nice, but it's essentially a fancy name for a CREATE query, MapReduce implies a chunked & distributed in nature. If a job can be described (be it in Cypher, Gremlin, XML, JSON, javascript, R, whatever) thrown at a cluster, chunked up and processed, the results being either returned, stored back to the same graph or stored into a new graph then that's powerful. Sure, you'd use it locally first!

What I think would be extra smart about breaking this into a new project is the evolution of an ecosystem where not everything is bundled into the core project taking it beyond what I see as it's job, but that interested parties start building function around it that leverage it's power.

I've not used it, but doesn't TinkerPop have some sort of Map Reduce type functionality?

M

Michael Fair

unread,
Sep 10, 2014, 3:14:33 AM9/10/14
to ne...@googlegroups.com

On Thursday, September 4, 2014 2:26:39 AM UTC-7, Mark Findlater wrote:
hood stuff as I do not know enough about how Neo maintains it's indexes, how the query/relationship/object caches work and whether there is any performance impact of creating twice as many relationships (I expect that if I do not get smarter with my data I would reach maximum capacity of Neo in 3 years, if I doubled the relationships I suppose I would half that, but that is very naive!). 

One possible confusion point; the process wouldn't create twice as many relationships between nodes, or even twice as many relationship types; it would only create twice as many relationship type labels (and that's assuming every type got an inverse defined on it (which not everything would)). 

So maybe you go from 50 type labels or 100 or even 500 type labels to twice that but the actual data in the graph itself doesn't change at all, there's just a few more string metadata entries about the graph.  I know you get this as what's meant by "not-persisting" the information but I wanted to be clear.  We have to persist _something_; so this process is only persisting additional labels for the relationship types.

That said; this whole conversation, especially your comments about formalizing this into a distributed processing system, has got me thinking that what we really ought to do here is take on thinking of every relationship type as a named sub-query and existing types are a specific trivial case.  See below.

 
Running with a team of developers I would not expect all or even many of them to be operating on raw queries and similar to not producing inverse methods all over the place (although you could cite Commons StringUtils isEmpty/isNotEmpty type design) I expect people to be comfortable with negation, or applying more complex logical analysis. Which is what I meant about exposing the raw query interface to the world at which point I might expect less of what people are comfortable with!

That's interesting, because I actually expect less complex and domain specific applications that expose more of the framework that enable repeating common usage patterns easy, making it easier for users to explore, navigate, add and composite the data and create new forms within their own graphs. :)
The application developers are going to know less and less about the specific schemas their users are using over time and focus more and more on providing good analysis and patterning tools that enable users to take more advantage of their graphs.  :)

 
If you could define the inferred relationships such that the planner could use them (he said :SIBLING but he means :SISTER or :BROTHER) or, as you touched on below defining the rules and then explicitly running a MapReduce type function to expose the data sound very interesting.

So I think this where you've struck upon the BIG IDEA!
I'm so glad you mentioned this as you're right; I was thinking way too small with this idea! 
You're spot on in thinking this should help out the distributed applications of the "Query Graph Search/Reducer" application.

If :SIBLING was one of these new "types as sub-query" things; and it was defined as something like
[s:SIBLING]
    MATCH path = (n)<-[:CHILD_OF]-(parent)-[:CHILD_OF]->(m)
    WHERE n != m
    SET s.gender = m.gender, s.parents = collect(DISTINCT parent)
    RETURN n, s, m

then it could absolutely work as you described. 

And as :SIBLING is now a named query type; defined directly in the database; if the db architect (or stats tracker process in the DB engine) decided this relationship was something worth optimizing for, then there's probably a way the engine could treat it like both an INDEX and a RELATIONSHIP, where the [:SIBLING] relationship can be dynamically rebuilt and maintained as needed.
Anytime a [:CHILD_OF] relationship is created/deleted, the engine could know/detect that it needs to update the [:SIBLING] index on the updated nodes.

Building on this idea, the [:SIBLING] relationship could then be used to then define other relationships like [:SISTER] and [:BROTHER].
[:SISTER] <= MATCH (n)<-[s:SIBLING {gender: "female}]-(m) RETURN s.parents
[:BROTHER] <= MATCH (n)<-[s:SIBLING {gender: "male}]-(m) RETURN s.parents
[:STEPSIBLING] <= MATCH (n)<-[s:SIBLING]-(m)-[:PARENT]->(p) WHERE p not in s.parents RETURN s.gender, s.parents

===============================
Now that we've got queries; let's talk about reducing the results.
Query:
MATCH path = (n)<--(p)-->(m)
RETURN graphReduce(path, [:BROTHER|:SISTER|:SIBLING])

So this would take a result set that starts like this [from (n)<--(p)-->(m)]:
(Johnny {gender: male})<-[:STUDENT_OF]-(Sandra)-[:STUDENT_OF]->(Sally {gender: female})
(Johnny {gender: male})<-[:CHILD_OF]-(Bob)-[:CHILD_OF]->(Sally {gender: female})
(Johnny {gender: male})<-[:CHILD_OF]-(Alice)-[:CHILD_OF]->(Sally {gender: female})
(Sally {gender: female})<-[:CHILD_OF]-(Bob)-[:CHILD_OF]->(Johnny {gender: male})
(Sally {gender: female})<-[:CHILD_OF]-(Alice)-[:CHILD_OF]->(Johnny {gender: male})

and return:

(Johnny {gender: male})-[:SISTER {parents: (Bob, Alice)}]->(Sally {gender: female})
(Johnny {gender: male})-[:SIBLING {gender: female, parents: (Bob, Alice)}]->(Sally {gender: female})
(Sally {gender: female})-[:BROTHER {parents: (Bob, Alice)}]->(Johnny {gender: male})
(Sally {gender: female})-[:SIBLING {gender: male, parents: (Bob, Alice)}]->(Johnny {gender: male})

I don't have the exact semantics nailed down for what happens if multiple relationship matches are found, I think it should just collect them and return them all in no particular order as that is likely the most parallelizable and distributed operation, as well as the most to definable.

So using this model, [:PARENT_TO] (the original inverse relationship goal) could be something like:
[p:PARENT_TO] <=
    MATCH (m)<-[c:CHILD_OF]-(n)
    SET p = c
    RETURN n, p, m

What do you think?

Mike

Peter Hunsberger

unread,
Sep 11, 2014, 2:09:32 AM9/11/14
to ne...@googlegroups.com
This caught my eye because it touches on an area that I have a deep interest in, namely the higher levels of classification of data sets and the generalization of metadata as data within graph databases.  The addition of higher level node types rapidly gets into the exploration of hypergraphs  and you may want to Google a bit on some of those implementations and the resultant implications. In particular, super nodes in any large graph.  IE; in your example, gender would divide any normally distributed population in half and not be much use for  partitioning. However, I digress... 

The whole concept of overlaying metadata on top of another graph has several standard use cases: type systems, security / authorization systems, schema, ontologies , etc.   The backwards derivation of the latter of these is particularly interesting.  Consider a small graph that builds on your example: SIBLING, PARENT, MOTHER, FATHER, BROTHER, SISTER,  etc.  It's easy to see how a separate graph of these relationships could be manually defined and the edges from these to some population might result in useful indexes to optimize searches.  More interesting would be watching queries go by and deriving these concepts: an optimizer / classifier might not know the name of the resultant concepts, but persisting the repeatedly used sub-query results as an index would yield metadata that can be used for further analytics if one so desired.  EG; we may not have a simple name for Friend of a Friend once removed (like 2nd cousin) but the concept may still be useful within some domain...

I think the key here would be identification of common or dense sub-query results.  Typically, that might start as some form of cache where we can spot high reuse of some set of items  The grouping of similar versions of these cached items would then start to point at patterns (eg. all sub query results using the SIBLING relationship)  Manual inspection would be the last step: these two subgroups statistically divide this population in half, further inspection shows that is because they diverge on the GENDER.  We now can label these sub-query results as BROTHER and SISTER.. Under the covers "naming" these results might actually be the naming of nodes of a metadata graph or hyper graph that is associated with the original graph and this is in turn initially a way of identifying the subquery results the system found to have a high incidence in the cache.

I'm mostly guessing at possible implementations, but maybe food for thought for someone...

Peter Hunsberger

--
You received this message because you are subscribed to the Google Groups "Neo4j" group.
To unsubscribe from this group and stop receiving emails from it, send an email to neo4j+un...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Michael Fair

unread,
Sep 11, 2014, 3:27:19 AM9/11/14
to ne...@googlegroups.com

Definitely food for thought and the ideas definitely have merit!  I was once developing an algorithm to identify groups based on clusters of in-common friends links; and as you touched in your post; naming/identifying these fledging anonymous groups was one of the first things I had to deal with. I hadn't even considered adding an optimizer/classifier to detect often matched relationships.  I mean it makes sense "these two nodes with this intervening path comes up often in queries; perhaps we can ask to name it?"

The other cool thing you touched on was a practical mechanism for how to track and manage these derived graphs; basically "create a new graph for them!".  That's way better (and not too mention easier) than what I was thinking of.  I could even see using the hooks mechanism to create and maintain these derived graph links.

As far as applications go; might brain breaks if I think about it too much; the implications are way too staggering.  Especially if the thing actually performs even half-way decently!  What's also great is family trees are complicated enough; yet has answers that are thouroughly comprehended enough to model the "correct" behaviour for a lot of this derived link stuff!

In the meantime, what's fantastic is I think neo4j already has the necessary query infrastructure for expanding these sub-queries at runtime through CYPHER's support of multiple starting points using aggregates from earlier clauses.  I'm thinking that's going to help.

Mike

You received this message because you are subscribed to a topic in the Google Groups "Neo4j" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/neo4j/swNSvsKtqf8/unsubscribe.
To unsubscribe from this group and all its topics, send an email to neo4j+un...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages