I want a query which returns a subgraph...

3,251 views
Skip to first unread message

Alan Robertson

unread,
Apr 25, 2016, 10:29:01 AM4/25/16
to Neo4J
Hi,

Is there a way to specify a query which returns an intact subgraph?

For example:

starting from node "X" (by some specification), return the subgraph reachable through a collection of relationship types {T}, through some maximum distance (d), return a JSON representation of this subgraph - possibly with some WHERE specifications, but possibly not. In our nodes we include a nodetype in every node, we might want to say "WHERE node.nodetype IN {N}" as our where clause.

Result should be in some JSON graph format, perhaps like JSON Graph Format (JGF) http://jsongraphformat.info/ or https://github.com/bruth/json-graph-spec.

This is exactly what's needed for visualization of a subgraph from a Javascript program. Since Neo4j graphs are often huge, even monstrous, having a query that does this would be incredibly handy...

Otherwise you have to go through and put this graph together a piece at a time (or so it seems to me).

I think I can write a query which returns  all the paths through this subgraph, but then I have to do the duplication elimination and so on myself. A single node might be returned many times in the possible paths result. Not an ideal situation...

This seems like a very common thing to want to do...


--

Alan Robertson / CTO
Al...@AssimilationSystems.com / +1 303.947.7999

Assimilation Systems Limited
http://AssimilationSystems.com

Twitter Linkedin skype

Alan Robertson

unread,
Apr 28, 2016, 10:00:02 AM4/28/16
to ne...@googlegroups.com, Michael Hunger, Nigel Small
Hi,

Michael Hunger separately provided me this query - which is a prototype query which looks like it will do what I want. It includes features of Cypher I was not familiar with (particularly unwind):

....
match path = shortestPath( (n)-[*]->(m) )
unwind nodes(path) as n
with collect(distinct n) as nodes
unwind rels(path) as r
return collect(distinct r), nodes
If I understand it correctly, this returns a single tuple with two elements:
  1. List of all individual relationships (no duplicates)
  2. List of all individual nodes (no duplicates)

I'll have to try this out!

I kind of wonder how this gets mapped into py2neo - what type are the tuple elements -- list(Relationship) and list(Node) or something like that...

Nigel?

I know my code won't currently handle this - but I can fix that ;-)

    -- Alan



On 04/25/2016 08:28 AM, Alan Robertson wrote:
Hi,

Is there a way to specify a query which returns an intact subgraph?

For example:

starting from node "X" (by some specification), return the subgraph reachable through a collection of relationship types {T}, through some maximum distance (d), return a JSON representation of this subgraph - possibly with some WHERE specifications, but possibly not. In our nodes we include a nodetype in every node, we might want to say "WHERE node.nodetype IN {N}" as our where clause.

Result should be in some JSON graph format, perhaps like JSON Graph Format (JGF) http://jsongraphformat.info/ or https://github.com/bruth/json-graph-spec.

This is exactly what's needed for visualization of a subgraph from a Javascript program. Since Neo4j graphs are often huge, even monstrous, having a query that does this would be incredibly handy...

Otherwise you have to go through and put this graph together a piece at a time (or so it seems to me).

I think I can write a query which returns  all the paths through this subgraph, but then I have to do the duplication elimination and so on myself. A single node might be returned many times in the possible paths result. Not an ideal situation...

This seems like a very common thing to want to do...

--
You received this message because you are subscribed to the Google Groups "Neo4j" group.
To unsubscribe from this group and stop receiving emails from it, send an email to neo4j+un...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Alan Robertson

unread,
Apr 28, 2016, 5:33:06 PM4/28/16
to ne...@googlegroups.com, Michael Hunger, Nigel Small
The query isn't quite right. It has an error in it (variable n already used) and something about the order of operations causes it to return more than one row - with duplicates ;-)

But this one seems to do the trick:
match path = shortestPath( (n)-[*]->(m) )
return collect(distinct unwind(nodes(path))), collect(distinct unwind(rels(path)))

What do you think?




On 04/28/2016 07:59 AM, Alan Robertson wrote:
Hi,

Michael Hunger separately provided me this query - which is a prototype query which looks like it will do what I want. It includes features of Cypher I was not familiar with (particularly unwind):

....
match path = shortestPath( (n)-[*]->(m) )
unwind nodes(path) as n
with collect(distinct n) as nodes
unwind rels(path) as r
return collect(distinct r), nodes
If I understand it correctly, this returns a single tuple with two elements:
  1. List of all individual relationships (no duplicates)
  2. List of all individual nodes (no duplicates)

I'll have to try this out!

I kind of wonder how this gets mapped into py2neo - what type are the tuple elements -- list(Relationship) and list(Node) or something like that...

Nigel?

I know my code won't currently handle this - but I can fix that ;-)

    -- Alan

Alan Robertson

unread,
Apr 29, 2016, 8:53:48 AM4/29/16
to ne...@googlegroups.com, Michael Hunger, Nigel Small
Hi,

I started looking at the results from this, and changing my code to deal with Relationships returned from a query - and discovered that this query didn't return a list of Relationships, it returned a list of Paths.

Not sure where the bug really is, but I filed it against Nigel as a py2neo issue.

https://github.com/nigelsmall/py2neo/issues/502

The best thing about this bug is that the query should work unmodified on any database with at least one relationship in it. The results will be different, but the types returned should not be.  I've quoted the bug report below for your reading pleasure ;-).


The query below should work on any database with at least one relationship in it. It should return one row with these fields:
1. nodes: list-of-Nodes
2. relationships: list-of-Relationships

MATCH start
MATCH p = shortestPath( (start)-[*]-(m) )
UNWIND nodes(p) as n
UNWIND rels(p) as r
RETURN collect(distinct n) as nodes, collect(distinct r) as relationships

But when I look at what py2neo is giving me, it returns
1. nodes: list-of-Nodes
2. relationships: list-of-Paths

This is with Neo4j 3.0.0, and Py2neo 2.0.8. It possible it's not your problem, but a Neo4j problem, or even a bug/deficiency in the REST API. But from reading your code, it looked like you weren't expecting this kind of result.

"neo4j::all":"3.0.0",
"py2neo": "2.0.8",

--
You received this message because you are subscribed to the Google Groups "Neo4j" group.
To unsubscribe from this group and stop receiving emails from it, send an email to neo4j+un...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Tim Colson

unread,
May 6, 2016, 12:05:21 AM5/6/16
to Neo4j, michael...@neotechnology.com, ni...@nigelsmall.com
match path = shortestPath( (n)-[*]->(m) )
return collect(distinct unwind(nodes(path))), collect(distinct unwind(rels(path)))
What do you think?
 
Thanks for the info, Alan! I too am interested in extracting subgraphs from a large hierarchy of wiki pages. :-)

What I'm doing seems less complex b/c that data is in a single inheritance hierarchy, but I'm still figuring out the best way to extract related nodes. 

I want to find "active" sub-trees in a hierarchy of ~100,000 wiki pages. I'm looking for pages recently updated (i.e. connected to recent "Day" nodes), then I want to select the page, it's peers, children, and lineage of parent pages up the tree to nearly the top of the "wiki space".    

I imported a Confluence wiki space into Neo, along with a Year/Month/Day time tree ala Mark Needham, and linked the Day nodes to the Page nodes based on modification Y/M/D. 

// Active match (but not complete) sub-tree  
MATCH
(y:Year)-->(m:Month)-->(d:Day)<--(p:Page)<-[r:PARENT_OF*]-(parent:Page)<-[:CONTAINS]-(s:Space)<--(server:Server)
where y.year=2015 // and m.month=1 // and s.key='DIRX'
return p,r,parent,s,server,y,m,d

That query yields all recently updated pages and parents, but not the peers, children, and related sub-trees (niece/nephews as it were). 




Anyway, I'm learning as I go and just wanted to share. :)
-Tim



Alan Robertson

unread,
May 8, 2016, 10:46:48 PM5/8/16
to Tim Colson, Michael Hunger, Neo4J
I think my original query needed "AS" clauses in the return statement.

I have a more complicated query that I am actually using. The key is that I want to only include nodes of a certain type, and certain types of relationships.

        MATCH start WHERE start.nodetype = 'Drone' AND start.designation = {name}                         
        MATCH p = shortestPath( (start)-[%s*]-(m) )  
        WHERE m.nodetype IN {nodetypes}
        UNWIND nodes(p) AS n
        UNWIND rels(p) AS r
        RETURN [x in COLLECT(DISTINCT n) WHERE x.nodetype in {nodetypes}] AS nodes,                        
        COLLECT(DISTINCT r) AS relationships

This is done with a %s format in it - that I have to fill in via a printf-like capability with the right :reltype|:reltype string. I don't know of a way to have a list of relationship types as a parameter (but maybe there is?).

There are two parameters {name} and {nodetypes} which give me the name of the starting node, and the types of nodes I'm interested in following through.

The [x in COLLECT(DISTINCT n) WHERE x.nodetype in {nodetypes}] list comprehension is there because shortestPath might follow these types of relationships through nodes I don't want in the output but ending up in nodes of a type that I do care about. Basically, there were unwanted nodes that came into the output that I wanted to filter out. The relationships that were associated with those nodes still come through, but they're pretty easily recognized.

If I know how to filter the shortestPath function to only go through nodes which meet some predicate, then this would all go away...

And if I could pass a list of relationship types, then I wouldn't have to do string substitution to create the query.

Anyone know of a way to fix these?

Alan Robertson

unread,
Jul 21, 2016, 4:09:09 PM7/21/16
to ne...@googlegroups.com, Nigel Small
On 04/28/2016 03:32 PM, Alan Robertson wrote:
The query isn't quite right. It has an error in it (variable n already used) and something about the order of operations causes it to return more than one row - with duplicates ;-)

But this one seems to do the trick:
match path = shortestPath( (n)-[*]->(m) )
return collect(distinct unwind(nodes(path))), collect(distinct unwind(rels(path)))
That query is quite fast -- but...

I don't actually want only the shortest paths, I want all the paths that involve those relationships.

But if I delete the shortestPath() function call it runs so long I get an HTTP timeout on many of these queries. I restrict the types of relationships with [:rel1type|:rel2type|:etctype*]. And even though the result only includes one or two additional paths, it runs like 100 times slower - and often just times out via REST (via Py2neo).

That's really disappointing.

Does anyone have a suggestion on how to make this faster - or at least not time out?

Alan Robertson

unread,
Aug 8, 2016, 6:41:11 AM8/8/16
to ne...@googlegroups.com, Michael Hunger, Eve Freeman, Alan Robertson
I had written up my problem here too:
     http://assimilationsystems.com/2016/07/03/assimilation-subgraph-queries/

Here's a thought which occurred to me while sleeping last night...
The two ways I've done it are:
  1. shortest paths from initial node set
  2. all paths

As noted in the original email thread below, the problems I ran into are:

  1. Shortest paths produces all nodes, but misses some relationships I care about.
  2. All paths produces all nodes and relationships but takes a really long time and often times out.
A way which occurred to me while sleeping tonight was to create a 2-step query like this:

Compute the set of shortest paths to get the set of nodes (as in (1) above). Ignore the relationships produced.
From that set of nodes (n) compute (n)-[:a|b|c|d]-(n) and return 'n', relationships.

If Cypher won't let me compute from 'n' to 'n', then compute the from n-[:a|b|c|d]-m WHERE m.nodetype is in the same set of node types as was in the original constraint. Then return n, relationships. [Although it's obvious, I'll mention that the difference is that the second query only follows a single level of relationship].

It still might need a little experimentation...

Any thoughts?

I'm going back to bed now. Will look into it some more in the morning... ;-)

Michael Hunger

unread,
Aug 20, 2016, 6:33:28 AM8/20/16
to Alan Robertson, ne...@googlegroups.com, Eve Freeman
Hey,


sorry for the delay. You could use the allShortestPaths option too.

And shortest paths actually take predicates into account that follow it in the where clause.

I don't know how you got this query to run, because UNWIND is not a function but a clause:
match path = shortestPath( (n)-[*]->(m) )
return collect(distinct unwind(nodes(path))), collect(distinct unwind(rels(path)))

For your other query, the initial lookup of those nodes will be slow (it has to scan the full graph) b/c you don't have a label (using an index) on them.
You probably also want to add a direction (if it makes sense)

Sorry I wrote the original statement just out of the top of my head, here is the fixed version

// bind n, m, limit rel-type(s)
match path = shortestPath( (n)-[*]->(m) )
// optionally WHERE with predicates for the shortest paths
unwind nodes(path) as node
with collect(distinct node) as nodes, collect(path) as paths
unwind paths as path
unwind rels(path) as r
with collect(distinct r) as relationships, nodes
return count(*)

// enable use of indexes, even if "Node" is just a generic label (I still think that using label-sets instad of nodetype is better
        MATCH (start:Node) WHERE start.nodetype = 'Drone' AND start.designation = {name}                          
        MATCH (m:Node) WHERE m.nodetype IN {nodetypes}
        MATCH p = shortestPath( (start)-[:`%s`*]-(m) )   
// this spans up a cross product between ALL nodes and ALL rels of the paths which can potentially become huge
// you can check it with returning count(*), count(distinct n), count(distinct r) after the two unwinds
        UNWIND nodes(p) AS n
        UNWIND rels(p) AS r 
        RETURN [x in COLLECT(DISTINCT n) WHERE x.nodetype in {nodetypes}] AS nodes,                         
        COLLECT(DISTINCT r) AS relationships

Alan Robertson

unread,
Aug 21, 2016, 2:13:20 PM8/21/16
to Michael Hunger, Alan Robertson, ne...@googlegroups.com, Eve Freeman
Thanks Michael!

I looked for a function like that, but I sure missed it!

Thanks again!
--
Sent from my Android device with K-9 Mail. Please excuse my brevity.
Reply all
Reply to author
Forward
0 new messages