referential integrity issue

Jens Dietrich

unread,

Sep 26, 2012, 12:31:47 AM9/26/12

to ne...@googlegroups.com

Hi,

I am experimenting with neo4j/cypher at the moment and I am puzzled about the following behaviour. I obtain references to two nodes if1 and cl1 using a node index. I do know that there is a relationship rel from if1 to cl1. I obtain this relationship and get its endNode. I expect this to be cl1, however, it is not! The endPoint is equal, but not identical to cl1!

This is a big problem if I navigate through graphs with cycles – I keep creating more and more objects all referencing the same nodes in the db!

What is the reasoning behind this feature? Is this perhaps just a bug? I had expected that neo4j behaves more like an ORM (hibernate etc) here: use a cache (usually implemented as a soft hash map) that maps ids to objects, and whenever a new object is requested from a datastore, first check whether it is already in memory and if so use the existing one.

Any advise / hint how to avoid this is appreciated!

Code below.

Full code here (can be checked out and executed as test case):

http://code.google.com/p/graph-query-benchmarks/source/browse/trunk/graph-query-benchmarks/src/test/nz/ac/massey/cs/graphbenchmarks/spikes/neo4j/NeoIssues.java

a drawing of the graph used in this example is here: http://goo.gl/VGxKR

Kind regards, Jens

@Test

public void testReferentialIntegrity() throws Exception {

Index<Node> nodeIndex = db.index().forNodes( "nodes" );

Node cl1 = nodeIndex.get("qname", "com.example.Class1").getSingle();

Node if1 = nodeIndex.get("qname", "com.example.Interface1").getSingle();

// single relationship

Relationship rel = if1.getRelationships(Direction.OUTGOING).iterator().next();

Node endNode = rel.getEndNode();

assertEquals(cl1,endNode); // succeeds

assertTrue(cl1==endNode); // fails !

}

Wes Freeman

unread,

Sep 26, 2012, 12:44:28 AM9/26/12

to ne...@googlegroups.com

You should not use == to compare objects like that in Java, in general, unless you know that two references point to exactly the same instantiation.

The thing is, you're instantiating two objects (one via index.get(...).getSingle, and one via getEndNode) that represent the same node, so equals returns true, but == returns false. That makes perfect sense to me.

In order for it to act differently, it (the driver?) would need to know which nodes you have retrieved and instantiated in your code, and when you ask for one of those nodes again, give you a new reference to the same node instance (rather than simply populating a new instance and giving you a new reference). Seems a bit much to expect, and would probably even be confusing to some who expect to receive new instances.

In short, you should just use .equals() to compare nodes.

Disclaimer: I'm mostly speculating here--I haven't actually verified anything in code.

Wes

Jens Dietrich

unread,

Sep 26, 2012, 12:53:19 AM9/26/12

to ne...@googlegroups.com

Hi Wes,

Sorry but I am not convinced by your argument - objects and their relationship form a graph, and this should be consistent with the database. This is the same issue as having objects pointing to rows in tables - you dont want two objects representing the same row. This wastes memory, and leads to inconsistencies if inconsistent copies occur on different objects representing the same persistent entity.

I am well aware of the difference between equals and ==, but I really expected identity (not equality) here!

Cheers, Jens

Wes Freeman

unread,

Sep 26, 2012, 12:53:46 AM9/26/12

to ne...@googlegroups.com

Sorry, I somehow missed your paragraph about caching and ORMs--clearly you know what needs to be done on the backend and are just asking why it wasn't done that way. (So you didn't need my 3rd paragraph.)

Anyway, I generally don't assume that == will return true for objects, even using an ORM where I'm comparing something theoretically to itself AND I know it's cached behind the scenes. Regardless of the implementation... you're making lots of assumptions about the cache implementation.

Wes

Wes Freeman

unread,

Sep 26, 2012, 1:15:46 AM9/26/12

to ne...@googlegroups.com

Hah. I should have gone to bed rather than making a half-baked attempt at a reply to the first email!

Good luck convincing the Neo4j team to implement caching for the embedded DB. :)

So far I'm only using the REST service, through various client libraries which definitely don't cache as described (they build a new object out of the REST responses--I actually have seen that code).

Goodnight. :)

Wes

Lasse Westh-Nielsen

unread,

Sep 26, 2012, 3:51:25 AM9/26/12

to ne...@googlegroups.com

Jens,

Your analysis is correct, obviously. But I think for your use case (traversing the graph) you would be better off using the Traversal Framework: http://docs.neo4j.org/chunked/stable/tutorial-traversal-java-api.html

I am guessing here, but I assume it is optimised wrt memory etc.

Lasse

Mattias Persson

unread,

Sep 26, 2012, 4:35:11 AM9/26/12

to ne...@googlegroups.com

Node and Relationship objects returned from the API are proxying the cached nodes and relationships so that cache eviction isn't depending on whether or not the domain/application code holds on to these node/relationship references or not. So getting back the same node at two different occasions will yield two different proxy instances.

2012/9/26 Lasse Westh-Nielsen <lasse.wes...@neopersistence.com>

--
Mattias Persson, [mat...@neotechnology.com]
Hacker, Neo Technology
www.neotechnology.com

Jens Dietrich

unread,

Sep 26, 2012, 5:29:41 AM9/26/12

to ne...@googlegroups.com

Interesting - is there an easy way to control this, e.g. could I plug in my own factory to create those proxies?

Cheers, Jens

Mattias Persson

unread,

Sep 26, 2012, 7:27:03 AM9/26/12

to ne...@googlegroups.com

No, would that be merely for performance/memory issues then?

2012/9/26 Jens Dietrich <jens.d...@gmail.com>

Jens Dietrich

unread,

Sep 26, 2012, 6:28:12 PM9/26/12

to ne...@googlegroups.com

Hi Mattias,

I want to run already implemented graph algorithms (in particular a motif/pattern and cycle detection) on a neo4j db. The algorithms use a rather generic api (very much like jung, with arbitrary vertex and edge types), but match nodes using ==.

Technically, I try to implement an adapter to run guery (http://code.google.com/p/gueryframework/ ) on top of a neo4j db. I did some benchmarking, and for graphs of non-trivial size guery scales much better when evaluating complex queries than cypher (which often just runs out of memory). So, indirectly, performance is the reason why I am interested in this.

I can understand why you are proxying nodes, but I am still puzzled why you don't reuse proxies by using a soft map (keys would be ids, values proxies). This would result in less proxies being created, and should not have any negative impact with memory usage (it would actually improve it - less object creation and gc).

BTW, I will be at SPLASH/OOPSLA and I am interested to attend the Neo4j workshop there. Perhaps there is chance to talk to you or somebody else from neo4j about what I am trying to do and why.

Cheers, Jens

Marko Rodriguez

unread,

Sep 26, 2012, 6:31:51 PM9/26/12

to Jens Dietrich, ne...@googlegroups.com

Hi Jens,

Guery looks interesting. I like that it connects via JUNG. Do you use Blueprints' GraphJung interface to connect this to Neo4j?

Thanks,

Marko.

http://thinkaurelius.com

Jens Dietrich

unread,

Sep 27, 2012, 12:44:00 AM9/27/12

to ne...@googlegroups.com, Jens Dietrich

Hi,

No, but I will try this - thanks for pointing this out. Guery has its own abstraction layer graph adapter, and I tried to implement an adapter for neo4j but starting running into problems, including the referential integrity issue.

Cheers, Jens

Peter Neubauer

unread,

Sep 27, 2012, 2:41:52 AM9/27/12

to ne...@googlegroups.com, Mattias Persson

Jens,
this looks interesting, and is probably a result of Nodes being proxy
objects, that are not == but Equal. Mattias, do you know more about
that?

Cheers,

/peter neubauer

G: neubauer.peter
S: peter.neubauer
P: +46 704 106975
L: http://www.linkedin.com/in/neubauer
T: @peterneubauer

Wanna learn something new? Come to http://graphconnect.com

> --
>
>

Jens Dietrich

unread,

Sep 27, 2012, 7:03:48 PM9/27/12

to ne...@googlegroups.com, Jens Dietrich

Hi Marko,

I had a look - this is a great idea but the problem for me is that this works for undirected graphs, whereas the algorithms I use work on directed graphs, i.e. I need the interface edu.uci.ics.jung.graph.DirectedGraph. I am new to blueprints, but it seems to support the idea of direction (with the Direction enum) - would you have any idea how to get a directed jung graph out of blueprints?

guery uses its own abstraction api at the moment (graph adapter, see http://code.google.com/p/gueryframework/source/browse/src/java/nz/ac/massey/cs/guery/GraphAdapter.java ) and it shouldn't be to hard to map this to blueprints / neo4j.

I have started to implement some performance tests to compare the different frameworks (http://code.google.com/p/graph-query-benchmarks/ ) and cypher on neo4j does not do very well there (it runs out of memory very quickly even for small graphs - tests used are here: http://graph-query-benchmarks.googlecode.com/svn/trunk/graph-query-benchmarks/src/nz/ac/massey/cs/graphbenchmarks/neo4j/ , the equivalent tests for guery are here: http://graph-query-benchmarks.googlecode.com/svn/trunk/graph-query-benchmarks/src/nz/ac/massey/cs/graphbenchmarks/guery/).

This is probably not a neo4j but a cypher issue, and I am keen to write an adapter to find out. guery is good at fast + low memory querying, but does not have any db functionality like transactions and persistency. But there might be some merit in combining it with neo4j to get the best of both worlds.

Cheers, Jens

On Thursday, 27 September 2012 10:33:02 UTC+12, Marko Rodriguez wrote:

Marko Rodriguez

unread,

Sep 27, 2012, 7:24:18 PM9/27/12

to ne...@googlegroups.com, Jens Dietrich

Hi,

Here is some documentation:

https://github.com/tinkerpop/blueprints/wiki/JUNG-Ouplementation

http://tinkerpop.com/docs/javadocs/blueprints/2.1.0/com/tinkerpop/blueprints/oupls/jung/GraphJung.html

Blueprints makes use of a directed graph model, not undirected. I don't know why I didn't implement DirectedGraph (did it so long ago). Perhaps we can simply add it with an "extends ..." ? I made a ticket: https://github.com/tinkerpop/blueprints/issues/306

If you map it to Blueprints then you get the benefit of Guery running over:

0. TinkerGraph

1. Neo4j

2. OrientDB

3. InfiniteGraph

4. Titan

5. DEX

6. OpenRDF quad/triple stores

For many of these systems (e.g. Neo4j), Blueprints serves as a very thin layer.

As you say, I also believe that Neo4j is not the cause of the memory issues.

Not to "steal you away," but if you are interested in TinkerPop, you can join the Gremlin-Users mailing list.

https://groups.google.com/forum/?fromgroups#!forum/gremlin-users

Likewise, we are doing some work over at Aurelius that is focused on distributed graph databases and distributed graph batch-processing that is natively Blueprints-oriented. If there is a mapping from Guery to Blueprints then your approach would be very welcome.

http://thinkaurelius.github.com/titan/

http://thinkaurelius.github.com/faunus/

Mailing list: https://groups.google.com/forum/#!forum/aureliusgraphs

Nice to meet you and hopefully we can share more thoughts into the future.

Take care,

Marko

http://thinkaurelius.com

Jens Dietrich

unread,

Sep 27, 2012, 7:48:03 PM9/27/12

to ne...@googlegroups.com, Jens Dietrich

Hi,

Thanks for those hints, this sounds all good. How does the blueprints adapter for neo4j deal with the referential integrity issue that sparked this discussion? Or does it just map interfaces?

Cheers, Jens

Marko Rodriguez

unread,

Sep 27, 2012, 8:08:56 PM9/27/12

to ne...@googlegroups.com, Jens Dietrich

Hi,

It just maps interfaces. There is very little logic between Blueprints and Neo4j. Other databases more so, but Blueprints was primarily inspired by the Neo4j API (and JUNG) so it has always had a strong alignment with Neo4j's "look and feel."

See ya Jens,

Marko.

http://markorodriguez.com

Peter Neubauer

unread,

Sep 27, 2012, 8:52:13 PM9/27/12

to ne...@googlegroups.com, Jens Dietrich

Jens,
yes, Cypher is not (yet) performance optimised. This will change in
the future to a point where it will not use the Neo4j Java API but
more performant ways of accessing the underlying layers that is not
constrained by the way the Neo4j Java API (and Blueprints) is using
stepwise moves from nodes->relationships->nodes which all operations
that are using that abstraction have to be piped through in order to
be usable. We are trying to get a declarative way of doing things into
place with Cypher in order to open up underlying changes without users
having to worry about it, and also to open for remote access to
non-JVM systems in a more portable way than an embedded JavaAPI (or
shipping a Groovy engine on the remote side). Also, Cypher allows you
to express higher order intents like Patterns rather than imperative
moves through the graph.

Right now, Cypher is slower (and not optimized yet) then the Neo4j
Java API since it relies on it for implementation. Over time, that
will change.

So, I would love to have you do an adapter for both an embedded Java
API of your choice, and Cypher so we could actually measure the
development over time there?

Cheers,

/peter neubauer

G: neubauer.peter
S: peter.neubauer
P: +46 704 106975
L: http://www.linkedin.com/in/neubauer
T: @peterneubauer

Wanna learn something new? Come to http://graphconnect.com

> --
>
>

Jens Dietrich

unread,

Sep 27, 2012, 9:56:53 PM9/27/12

to Peter Neubauer, ne...@googlegroups.com

Hi Peter,

I will implement an adapter for blueprints as suggested for better benchmarking.

I would appreciate if you could have a quick look at the benchmark project to see whether I use neo4j queries correctly.

The project is here: http://code.google.com/p/graph-query-benchmarks
neo4j tests: nz.ac.massey.cs.graphbenchmarks.neo4j
equivalent guery tests: nz.ac.massey.cs.graphbenchmarks.guery
the queries are static vars in the Queries classes in the respective packages

Something interesting happens when I run Neo4JTestSmallGraph_JMoney#testSTK3 (in this query, the length of paths in the patterns is restricted to <=3, if I remove this restriction, the queries will fail with an out of memory error!). One result is this (key -> value):

subtype -> net.sf.jmoney.model.DoubleEntry
supertype -> net.sf.jmoney.model.Entry
inherits -> net.sf.jmoney.model.DoubleEntry > net.sf.jmoney.model.Entry
uses -> net.sf.jmoney.model.Entry > net.sf.jmoney.model.DoubleEntry > net.sf.jmoney.model.Account > net.sf.jmoney.model.DoubleEntry

Note that the binding of uses contains the same node twice, i.e. the path has a loop. Cypher does not seem to catch this! There might be an explanation, but it is more likely that this is a serious bug. This would also explain that the other queries run out of memory. There are also some tests to check whether one can safely iterate over some solutions (i.e., whether the computation is "on demand") - testSTKExtendsFind1() and testSTKExtendsFind10(). They run out of memory as well - so it looks to me that cypher tries to precompute all results, even though there is no aggregation being used in this query.

This could indicate that you might be better off by investing into pattern matching than in replacing high level by low level apis.

BTW, will you run the tutorial at SPLASH/OOPSLA ? I am interested to attend this (I actually started to play around with neo4j to prepare for this ..). Would be good to have a chat.

Cheers, Jens

Peter Neubauer

unread,

Sep 28, 2012, 4:33:55 AM9/28/12

to ne...@googlegroups.com

Jens,
checking out the sources right now (but takes some time from American
Hotel WIFI) and will try to look at it tomorrow.

Jim (not James) Webber is running the tutorial, I think you both will
enjoy it, http://splashcon.org/2012/program/435 for reference.

Cheers,

/peter neubauer

G: neubauer.peter
S: peter.neubauer
P: +46 704 106975
L: http://www.linkedin.com/in/neubauer
T: @peterneubauer

Wanna learn something new? Come to http://graphconnect.com

> --
>
>

Peter Neubauer

unread,

Sep 28, 2012, 5:28:10 AM9/28/12

to ne...@googlegroups.com

Jens,
the Identity issues were already discussed by Mattias (btw, feel free
to test another proxy strategy and benchmark it - contribution is very
welcome). I just ran the Cypher query - looks like there might a bug
around uniqueness, need to investigate more. Will report back.

Cheers,

/peter neubauer

G: neubauer.peter
S: peter.neubauer
P: +46 704 106975
L: http://www.linkedin.com/in/neubauer
T: @peterneubauer

Wanna learn something new? Come to http://graphconnect.com

> --
>
>

Jens Dietrich

unread,

Sep 28, 2012, 5:32:02 AM9/28/12

to ne...@googlegroups.com

Thanks Peter, appreciated !

Jim Webber

unread,

Sep 28, 2012, 11:21:42 AM9/28/12

to ne...@googlegroups.com

Hi Jens,

> BTW, I will be at SPLASH/OOPSLA and I am interested to attend the Neo4j workshop there. Perhaps there is chance to talk to you or somebody else from neo4j about what I am trying to do and why.

That'll be me. See you then!

Jim

Reply all

Reply to author

Forward