Importing Paradise Papers relationships CSV file

242 views
Skip to first unread message

leet.h...@gmail.com

unread,
Nov 22, 2017, 3:32:17 PM11/22/17
to Neo4j
Hi! Has anyone here has worked with the Paradise Papers CSV dataset? (https://offshoreleaks.icij.org/pages/database) The icij have used neo4j for their graph db, and from that link, offer the CSV files of the data. I was able to create the nodes for the graph, but I'm having a tough time creating the relationships from the edges CSV - it is currently importing now (~4 hours), but I'm hoping there is a better way out there than how I did it!

The difficulty for me, apart from being new to neo4j, is that the edges CSV contains all the relationships (5 different types) with the node_id for the source and target id specified. The node_id is unique to a node that is one of 5 types of nodes. So I figured that I could write a statement (ignoring properties) that would read the CSV as 'line' and then:

MATCH (n1 {node_id: line.`node_1`}), (n2 {node_id: line.`node_2`})
CREATE (n1)-[:line.`rel_type`]->(n2);

The problem with this is that you can't programmatically specify the relationship type.. I don't think. So I came up with the following:

MATCH (n1 {node_id: line.`node_1`}), (n2 {node_id: line.`node_2`})
FOREACH(ignoreMe IN CASE WHEN line.`rel_type`='registered_address' THEN [1] ELSE [] END |
  MERGE (n1)-[:REGISTERED_ADDRESS]->(n2)
)
<Other FOREACH statements, one for each type of relationship> ...

Now that last idea works, but really slowly, even with indexes on node_id for each node type. It was creating about 25 relationships every 10 seconds which wasn't going to work for ~ 400,000 relationships.

What I ended up doing was dumping the CSVs into a MySQL db and through a multi join query, 'selected' the individual CREATE statements for every relationship, saved this to a file, installed APOC, granted permissions and then ran the file using runFile. It is faster now (probably going to take 4-5 hours) but seems overly complicated. I'm hoping someone has a better way of doing it!

Ideas? :)

Michael Hunger

unread,
Nov 22, 2017, 4:21:27 PM11/22/17
to ne...@googlegroups.com
I have an import script here: https://www.dropbox.com/s/6wz3bjee6s4oy4p/import-offshoreleaks-neo4j.sh?dl=0
and then run this in cypher-shell / neo4j-shell: https://www.dropbox.com/s/tglph6hxro78v13/configure.cql?dl=0

But there will be also a neo4j database release really soon.

Cheers, Michael


--
You received this message because you are subscribed to the Google Groups "Neo4j" group.
To unsubscribe from this group and stop receiving emails from it, send an email to neo4j+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Michael Hunger

unread,
Nov 22, 2017, 4:24:12 PM11/22/17
to ne...@googlegroups.com
You need to have an label in your MATCH statements
You can use a generic "Node" label for this. and then:

Add that label when you create nodes.

create a constraint on (n:Node) assert n.node_id is unique;
...
MATCH (n1:Node {node_id: line.`node_1`}), (n2:Node {node_id: line.`node_2`})

Michael

On Wed, Nov 22, 2017 at 7:57 PM, <leet.h...@gmail.com> wrote:

--

Jon Forsyth

unread,
Nov 22, 2017, 4:50:59 PM11/22/17
to Neo4j
Using the `neo4j-admin import` command on a CSV file you already have downloaded, rather than doing this over the web will be your best bet.  Run that command and it will print out the usage.
 
-Jon

leet.h...@gmail.com

unread,
Nov 23, 2017, 11:44:44 AM11/23/17
to Neo4j
Thanks for this! - I'm guessing you helped me out on slack as well :) The scripts worked well, though I had to make a slight change on the sed commands for linux.

There were also about 1100 'bad' logs, they all seemed to come from nodes already existing - but 1100 out of the whole dataset seems ok..

Thanks again!
To unsubscribe from this group and stop receiving emails from it, send an email to neo4j+un...@googlegroups.com.

Michael Hunger

unread,
Nov 23, 2017, 3:47:56 PM11/23/17
to ne...@googlegroups.com
Looking forward to your findings 

Von meinem iPhone gesendet

greg bahde

unread,
Nov 24, 2017, 5:33:26 AM11/24/17
to Neo4j
Hello,
Thanks for the scripts. I guess I have to change the sed command too, since it doesn't work.
What did you change? I can't seem to make it work

regards

Michael Hunger

unread,
Nov 24, 2017, 6:10:23 AM11/24/17
to ne...@googlegroups.com
What is your error message?

To unsubscribe from this group and stop receiving emails from it, send an email to neo4j+unsubscribe@googlegroups.com.

Kevin Burton

unread,
Jan 25, 2018, 6:55:01 AM1/25/18
to Neo4j
Is a neo4j database available?
To unsubscribe from this group and stop receiving emails from it, send an email to neo4j+un...@googlegroups.com.

Michael Hunger

unread,
Jan 26, 2018, 5:47:56 AM1/26/18
to ne...@googlegroups.com



Von meinem iPhone gesendet

Kevin Burton

unread,
Jan 26, 2018, 8:49:51 AM1/26/18
to ne...@googlegroups.com
The link doesn’t seem to work.

You received this message because you are subscribed to a topic in the Google Groups "Neo4j" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/neo4j/N6P7PD9oVXs/unsubscribe.
To unsubscribe from this group and all its topics, send an email to neo4j+un...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages