Re: [Neo4j] Need help with designing deep tree of type relationships - can Neo4J cope?

288 views
Skip to first unread message

Peter Neubauer

unread,
Nov 15, 2012, 10:28:40 AM11/15/12
to Neo4j User
Ian,
I think we can figure this out. Could you do a samll-ish sample over at http://console.neo4j.org/usage.html, so we have a dataset to play with? Then, we could reason about this. To start with, what do you mean by "the first one down each path"? In your example (in the console) what would that path be?

Thanks Ian!

/peter


Cheers,

/peter neubauer

G:  neubauer.peter
S:  peter.neubauer
P:  +46 704 106975
L:   http://www.linkedin.com/in/neubauer
T:   @peterneubauer

Neo4j 1.8 GA - http://www.dzone.com/links/neo4j_18_release_fluent_graph_literacy.html


On Thu, Nov 15, 2012 at 2:51 PM, Ian Tompsett <iantom...@gmail.com> wrote:
I'm after some database design help, which will need some explanation to set the context.

I'm using Neo4J (1.8 community edition, server mode - not embedded) to store a map of the type relationships within the DLLs deployed within my organisation. I don't think it matters, but they're .NET DLLs, generated from C# code. 

I have about 2500 DLLs or so to parse. In the process of parsing I build up a list of all C# types detected, and for each type the list of other types that it uses (by variable reference, method reference or base type reference). Hope that makes sense - I want to end up with a complete tree of type dependencies to allow impact analysis. 

Once I have all of the types, certain known types are identified - we call them 'Entity' types. Example entity types are web pages, web service proxies, database commands. 

There is then a GUI that sits on top of this information that allows a user to select a known entity and see it's tree of entity dependencies: up the tree to see the entities which use the entity in question (and all the way up the tree, not just one layer); and down the tree to see the entities used by the entity in question (and again further down to see downstream entity dependencies). In this view, the types that are not known entities are not shown, although their existence is still needed to link known entities.

So basically I'm looking at a very deep tree, with certain types of nodes to be shown, but the intervening nodes not shown

I can successfully get the data into the Neo4J database - I manage to get all of the DLL, type and type relationship information into a QuickGraph bidirectional graph which I then save as a GraphML file. Using Neo4jClient's Gremlin support it's then a one step operation to get the data imported into Neo4J via loadGraphML.

Once the data is there, I end up with 120702 different type nodes and 471144 relationships between these types - these relationships I have given the type 'TypeUsedByType'. 

From these 120702 type nodes I then manage to identify 6500 types that are my special 'Entity' types.

And then the uncertainty about what approach begins as I'm having trouble getting Neo4J to return the tree (up and down) for a single type. 

So far I have given each Type node that refers to an 'Entity'  an additional property - the numberic EntityTypeID of the particular entity type (eg WebPage, DBCommand, WebService etc). I have also created a set of nodes for the different entity types and added a relationship between each identified entity and the node for that entity type.

I would then like a set of relationships between the known entities - say 'EntityUsedByEntity'. This would cut out the middle men - the types that come between entities that aren't entities themselves. This is the set of relationships I would like to show in the GUI.

So to identify these EntityUsedByEntity relationships, I need to take each of my 6500 identified entities and go down its 'TypeUsedByType' relationships until a known entity is found. At that stage, I would like to stop (I don't need to go any further) and create an EntityUsedByEntity relationship between the source entity and the found entity.

Which comes to my question: how best to do this in a way that Neo4J can cope with. I should say up front, I'm trying this in cypher, but if gremlin seems a better choice I'll try that. Using the java traversal API doesn't seem an option as I'm accessing Neo4J as a server - and I'm writing in C#.

Ideally I'd like a single cypher query to create all the EntityUsedByEntity relationships

start entity=node:node_auto_index(NodeType = 'TypeVertex')   // currently I only have an index on the node type, but could create one for the entity nodes
match entity<-[:TypeUsedByType*1..]-entity2       // but how to match only the first one down each path?
where has(entity.EntityTypeId) and has(entity2.EntityTypeId)
create unique entity<-[:EntityUsedByEntity]-entity2

I had no luck with this - it went away and never came back, so I just took a single entity node and changed the start expression to just refer to it. ie start entity=node(99731). Even then it had troubles...

I've tried making the length of the relationships unbounded, but in the particular case I'm looking at, that doesn't seem to come back - I can bound it at 11 and it comes back in 17 seconds on my machine (i7 laptop running Windows XP). If I leave it unbound, it doesn't seem to come back.

As mentioned, when I identified my entities I also created a relationship back to the underlying entity node. I tried including that in the match clause, but that didn't seem to make any difference.

The problem seems to me that I don't know how many relationships to go down, so I can't use an upper bound. But that then means that cypher will find all the paths before applying the where clause. From my cursory look at the traversal API it seems possible to apply a filter and each step down, but I can't see how to do this in cypher - is it possible?

I must say that I'm a bit surprised that Neo4J has trouble with a deep tree like this. Is there something else I should be doing - smarter cypher, different relationships etc.

Thanks heaps if you've managed to read this far, I hope this makes enough sense. Any advice/suggestions would be greatly appreciated

BTW I used to have this data in Oracle and tried to find the relationships with connect by queries. It was having trouble (aka not coming back) when I was trying to populate all of the EntityUsedByEntity type relationships (by inserting into an EntityRelationship table) in one operation. But I'm pretty sure it would cope with my single type case that's causing me issues in Neo4J. But I was expecting Neo4J to be better at coping with these heavy relationships. Hopefully there's something reasonably obvious I'm missing here.

Ian

--
 
 

Ian Tompsett

unread,
Nov 15, 2012, 3:53:37 PM11/15/12
to ne...@googlegroups.com
Thanks Peter

I'll have a look at the best way to get a sample for you - I should be able to choose a subset DLLs that still have the appropriate depth. I haven't played with the console, and my initial data load was done via a bit of gremlin from within neo4jclient so that might take me a little while. There'll also be a timezone lag too - I'm in Australia.

In the meantime, re "the first down each path": my cypher queries return a (large) number of paths. Each of those paths follows a tree of type dependencies. Within each of those paths some of the nodes (ie types) will be my known entities. So far each path, I only want the path down as far as the first matching entity - even though the path could extend much further.

For example given a single path of type usage:  A->uses->B->uses->C->uses->D->uses->E->uses->F->uses->G->uses->H

And say A, D and F are known entity types, when starting from A I would like my query to return the path to D, without the need to get to F. I would then create a direct relationship between A and D, but no direct relationship between A and F. Separately I would create a relationship between D and F.

So far I've only managed to get my queries to return all of the paths in their entirety and then perform the filtering afterwards (using a where clause). This means that all of the paths are hugely deep and have a lot of repeated information. 

Hope that makes it a bit clearer.

Thanks
Ian

Michael Hunger

unread,
Nov 15, 2012, 4:14:07 PM11/15/12
to Neo4j, Tatham Oddie, Romiko Derbynew
Ian,

if you're in Australia and using neo4jclient you should try to chat to Tatham and Romiko which are also in downunder and the authors of that library. They have quite a lot of modeling experience.

Michael

--
 
 

Ian Tompsett

unread,
Nov 15, 2012, 4:34:25 PM11/15/12
to ne...@googlegroups.com, Tatham Oddie, Romiko Derbynew
Thanks Michael

I had wondered the same thing, but for now my issue has morphed into more of a cypher question rather than a Neo4jClient issue. But drawing on Tatham and Romiko's modeling experience would be easier given the timezone compatibility - any idea how to get their attention here? Do you know if they cruise these forums much?

Ian

Jens Dietrich

unread,
Nov 15, 2012, 7:42:10 PM11/15/12
to ne...@googlegroups.com, Tatham Oddie, Romiko Derbynew
Hi Ian,

I had to deal with a similar scenario - finding patterns in software dependency graphs that need reasoning about paths (= computation of transitive closure). If you are interested, this has resulted in this tools: http://xplrarc.massey.ac.nz/ , the queries are in the analyse menu. The largest graphs (extracted from the JRE) have a similar size to your graphs.

In my experience the current version of cypher does not cope very well with this (there are some older discussions on this issue in the forum, and a benchmarking project to demonstrate some of the problems: http://code.google.com/p/graph-query-benchmarks/). You might want to try guery as an alternative to query your neo4j database - http://code.google.com/p/gueryframework/ - the latest version has a blueprints adapter. Have a look at the test cases, they contain some examples similar to what you need. 

Cheers, Jens (from even more down under .. NZ) 

Ian Tompsett

unread,
Nov 16, 2012, 12:09:05 AM11/16/12
to ne...@googlegroups.com
OK, I think I've created a suitable (albeit very small) test case at http://tinyurl.com/chyu863

Not surprisingly this returns queries pretty quickly, but should be enough to get the structure in place. I've added some noise in there to include some paths that shouldn't be followed.

Hopefully it's clear enough - TypeA is the start point (node 4) and it's an entity. It has 3 direct type relationships - to TypeB, Type AC and Type AD. And then there's downstream type relationships, some of which are known entity types (reflected by both an EntityTypeId property, and a relationship back to the EntityType nodes). TypeA has entity relationships to 2 and 3, but not 4 - as that's via TypeAD

So initially what I'd like is the most efficient cypher query for detecting the direct entity relationships for TypeA - EntityType2 and EntityType3 in my example. That is, I'd like to see the following relationships returned

TypeA-TypeD
TypeA-TypeAC
TypeA-TypeAD

Thanks again for your help

Niels Hoogeveen

unread,
Nov 16, 2012, 9:25:56 AM11/16/12
to ne...@googlegroups.com
I don't have a direct answer to the situation described here, though I would like to add my two cents anyway.
 
Thinking about this problem yesterday, it occurred  to me that a traversal description over a graph can be viewed as a formal language expression, where the alphabet of the language consists of RelationshipTypes.
 
The following operations can easily be achieved:
 
concatenation: creation of paths
union: choice  between two traveral descriptions
kleene star: repeated traversal description
 
These three operations give a regular language, and with that have the same expressive power as a regular expression.
 
It is possible to add intersection and complement, resulting in a visibly pushdown language. I am not an expert on Cypher or Gremlin, but as far as I am concerned, all these operations can be supported in a query language. For the content management system multispective.com, I have implemented these operations, giving a reasonably powerful language for querying a Neo4j database.
 
Niels

okkyt...@gmail.com

unread,
Nov 16, 2012, 1:32:34 AM11/16/12
to ne...@googlegroups.com
T
Sent from my BlackBerry®
powered by Sinyal Kuat INDOSAT

From: Ian Tompsett <iantom...@gmail.com>
Date: Thu, 15 Nov 2012 21:09:05 -0800 (PST)
Subject: Re: [Neo4j] Need help with designing deep tree of type relationships - can Neo4J cope?
--
 
 

Peter Neubauer

unread,
Nov 17, 2012, 10:20:50 AM11/17/12
to Neo4j User
Ian,
I am not sure I am totally with you, but this query gives you the results you want:

start t1=node(5) 
match p=t1-[:TypeUsedByType*1..]-t2-[:Entity]->e1 
return t1.name, e1.name,t2.name, length(p) as l 
order by l asc 
limit 3


Does that make sense?

/peter


Cheers,

/peter neubauer

G:  neubauer.peter
S:  peter.neubauer
P:  +46 704 106975
L:   http://www.linkedin.com/in/neubauer
T:   @peterneubauer

Neo4j 1.8 GA - http://www.dzone.com/links/neo4j_18_release_fluent_graph_literacy.html


--
 
 

Ian Tompsett

unread,
Nov 18, 2012, 9:22:54 AM11/18/12
to ne...@googlegroups.com
Thanks Peter

What I'm after is a reusable way of only picking up the types that are entities that are directly related to TypeA. By directly related, I mean there are no other entities between the type in question and TypeA.

If you try the following

start t1=node(5) 
match p=t1-[:TypeUsedByType*1..]->t2-[:Entity]->e1 
return t1.name, e1.name,t2.name, length(p) as l, extract (n in nodes(p) : n.name)

You'll see the TypeAF appears in the list. You've managed to filter out by basically knowing the results up front (ie the Limit 3). What I'd like to do is filter it out by detecting that there is another entity (TypeAD) between TypeAF and TypeA.

In the cypher query I've played around with trying a second match clause to exclude t2 values that satisfy the first match, but do not have an entity between them and t1. So that would exclude TypeAF in our example.

Hope that makes sense? Any ideas how to cypher that? Something like

Something like:
start t1=node(5) 
match p=t1-[:TypeUsedByType*1..]->t2-[:Entity]->e1 ,
      p2=t2<-[:TypeUsedByType*1..]-t3<-[:TypeUsedByType]-t1, p3=t3-[:Entity]-e2
where p1 exists but p2 and p3 don't???
return t1.name, e1.name,t2.name, length(p) as l, extract (n in nodes(p) : n.name)

I think I'm looking for an outer join that fails sort of concept here, to use SQL parlance.

Thanks

Ian Tompsett

unread,
Nov 20, 2012, 5:03:17 PM11/20/12
to ne...@googlegroups.com
HI Peter (or anyone else..)

Have you had a chance to look at this? I'm getting concerned that the answer to my initial question is "no" - especially given Jens' comment.

Whilst I do have a deep hierarchy to contend with, I don't need to look at it all, I'd just like to find a way of getting my cypher query to understand how early it can stop.

Unless my only option is the traversal API?

Thanks
Ian

Michael Hunger

unread,
Nov 20, 2012, 6:22:21 PM11/20/12
to ne...@googlegroups.com
Ian,

I think you want to filter paths, depending on the nodes that are in there:

something like this:

start t1=node(5) 
match p=t1-[:TypeUsedByType*1..]->t2,t2-[:Entity]->e1 

where SINGLE(n in tail(nodes(p)) 
             where has(n.EntityTypeId)) 

return t1.name, e1.name,t2.name, length(p) as l, extract (n in nodes(p) : n.name)

I check if there is only a single node (aka t2) that has an EntityType property
tail(nodes(p)) because otherwise t1 would also be in there.
and e1 also has a property EntityType so I split the path :)


Cheers

Michael

--
 
 

Marko Rodriguez

unread,
Nov 20, 2012, 5:05:44 PM11/20/12
to ne...@googlegroups.com

Hi,

Have you looked at Gremlin? It is lazy and has path and tree operators.

Good luck,
Marko.

http://markorodriguez.com

--
 
 

Ian Tompsett

unread,
Nov 21, 2012, 7:13:52 AM11/21/12
to ne...@googlegroups.com
Thanks Michael

That looks promising, although I haven't had a chance to try it with my real set of data (I'm not working today) to guage performance. I'll let you know.

Ian

Ian Tompsett

unread,
Nov 21, 2012, 7:18:30 AM11/21/12
to ne...@googlegroups.com
Hi Marko

Firstly, apologies if my reply gets duplicated - I posted a reply (or at least thought I did) but it doesn't seem to have appeared.

I had a little look at gremlin - enough to think it has potential. But it's a bit foreign to me so I haven't quite got my head around how to try it. Don't suppose you've any examples doing something similar that you could point me at? (Gremlin isn't the only lazy one).

Thanks
Ian

Ian Tompsett

unread,
Nov 25, 2012, 11:54:53 PM11/25/12
to ne...@googlegroups.com
Michael

Well I've finally managed to try it with my real data. It worked out for a simple case - a type that had one Entity related to it directly. That type wasn't using many others. Then I went a little up the type usage hierarchy, but by no means to the top, and it collapsed in a heap. I tried it once and went to lunch (after it had been going for about half an hour). When I came back it had stopped with an "Invalid Query" error (why it couldn't tell me that straight away I don't quite understand). I tried again and several hours later still no result.

The type I chose used 5 entities directly (detected by leaving out the '*1..' from your query), but I don't know how deep the hierarchy would be underneath - looking at the hierarchy in an unbounded manner like that seems a recipe for failure.

This I'm doing via the Data browser web interface.

I used your suggested query verbatim (apart from changing the start node).

I think you now understand my data requirements, do you think there is another way I should structure things (eg create different relationships) that would help?

Or perhaps I'm faced with the choice of trying to work out how to do this using Gremlin or just accepting that this isn't Neo4J's thing and looking elsewhere.

Thanks
Ian
Reply all
Reply to author
Forward
0 new messages