neo4j-import non-deterministically corrupts a few node ids

53 views
Skip to first unread message

Zongheng Yang

unread,
Jun 4, 2015, 11:12:43 PM6/4/15
to ne...@googlegroups.com
Hi all,

I'm using neo4j-import to import nodes and relationships from csv files. Let's say node id 538398 has about 100 edges and

538398 -> 370047
538398 -> 379981

are just two of them.  After the import, the neo4j database actually 

- *loses* these two edges
- instead *corrupts* the destination ids, as follows

    538398 -> 380047
    538398 -> 389981

- *keeps* all other outgoing edges of 538398 correct

The problem seems to be non-deterministic: doing a `rm -rf dbPath` and re-running neo4j-import seems to fix the issue, for this particular node -- but I've not done extensive tests to see whether other nodes get corrupted in this way.

Has anyone seen this before? The graph has on the order of 1 million node, average degree 40. 

Zongheng

Mattias Persson

unread,
Jun 11, 2015, 5:32:55 AM6/11/15
to ne...@googlegroups.com
Hi, I'm one of the main authors of the import tool and I find this issue quite interesting.

Would you be able to share your dataset with me personally, for the single purpose of trying to find the root cause?

Mattias Persson

unread,
Jun 15, 2015, 8:23:24 AM6/15/15
to ne...@googlegroups.com
Hello again, I'm quite confident I know what's happening here. The problem is the misconception that your INTEGER ids defined in the csv files will map 1-to-1 to the neo4j node/relationship ids in the store. They will actually match in most cases, but that's merely a coincidence.

What you're seeing is the result of some parallelism happening in the importer where batches of 10k nodes/relationships flows through different steps, where some steps may execute multiple batches in parallel and doesn't care if reordering happens. Ids are assigned at the end.

You're looking at the ids and see that they mismatch, but if you look at their data you should see that all relationships match the csv files. So please disregard the seemingly close match of neo4j node/relationship ids with the csv input ids as they are quite different in nature.

Zongheng Yang

unread,
Jun 15, 2015, 2:59:54 PM6/15/15
to ne...@googlegroups.com
Hi Mattias,

Thanks for looking into this.  I understand the difference between Neo4j internal ids vs. the ids supplied in the csv. 

However for say GraphDatabaseService#getNodeById(long id), does this function take the user-supplied ids or Neo4j's internal ids?

If it is the former: then the conceptual mismatch doesn't fully explain the problem (e.g. I queried the nodes/edges using user-supplied ids, and the internal ids should not mess up with the query results).  If it is the latter, then for users programming using the Java Core API, how should they get these correct internal ids (they only know application-supplied ids).

Best,
Zongheng

Michael Hunger

unread,
Jun 15, 2015, 3:07:56 PM6/15/15
to ne...@googlegroups.com
GraphDatabaseService#getNodeById(long id)

takes Neo4j internal ids.

Michael

--
You received this message because you are subscribed to the Google Groups "Neo4j" group.
To unsubscribe from this group and stop receiving emails from it, send an email to neo4j+un...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Zongheng Yang

unread,
Jun 15, 2015, 3:31:48 PM6/15/15
to ne...@googlegroups.com
I see.  Would setting the `--processors 1` flag for neo4j-import make internal ids and external ids match in my case?  (I understand this is an implementation detail and not a user-facing property.)

Michael Hunger

unread,
Jun 15, 2015, 3:34:18 PM6/15/15
to ne...@googlegroups.com
No, --id-type actual 
would but then you have to make sure to have globally unique incrementing id's without large holes in the distribution.

Zongheng Yang

unread,
Jun 15, 2015, 3:43:38 PM6/15/15
to ne...@googlegroups.com
Fantastic, in my case the ids are exactly the sequence [0, 1, ..., N] without gaps, unique, and in that order.

Thanks both of you for the help!

Mattias Persson

unread,
Jun 16, 2015, 7:55:18 AM6/16/15
to ne...@googlegroups.com
Yes, I agree --id-type ACTUAL will guarantee this constraint.

Zongheng Yang

unread,
Sep 12, 2015, 5:05:31 PM9/12/15
to ne...@googlegroups.com
I think I got hit by this issue again, on a different dataset.

Mattias / Michael, could you clarify that what "without large holes in the distribution" precisely means?

My node csv has a line for each node, and line K (0-indexed) uniquely corresponds to data of node K (0-indexed).  There are exactly as many number of lines as the number of nodes in the graph.  So it should respect this property.

However, for the edge csv, does it have to satisfy any special property?

You received this message because you are subscribed to a topic in the Google Groups "Neo4j" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/neo4j/5k0xY6B1vtA/unsubscribe.
To unsubscribe from this group and all its topics, send an email to neo4j+un...@googlegroups.com.

Michael Hunger

unread,
Sep 12, 2015, 6:24:00 PM9/12/15
to ne...@googlegroups.com
It means that you don't have id's which are huge, e.g 100M or 5bn while just having a few nodes. Then the store-file would grow to accommodate the huge record-id.

Which version are you on? Afaik Mattias fixed an issue in that area?

Michael

Zongheng Yang

unread,
Sep 13, 2015, 1:15:59 PM9/13/15
to ne...@googlegroups.com
Michael, thanks for chiming in.

This turned out to be a mistake of the ETL process using outdated input.  I'm using 2.2.2; is there any critical fix in newer versions?

Michael Hunger

unread,
Sep 13, 2015, 2:15:48 PM9/13/15
to ne...@googlegroups.com
Yes, see the release notes: http://neo4j.com/release-notes
Reply all
Reply to author
Forward
0 new messages