Import Data From Oracle to Neo4J

827 views
Skip to first unread message

Jesse Liu

unread,
Feb 3, 2015, 3:18:39 AM2/3/15
to ne...@googlegroups.com
Hi, All,

I'm a beginner of graph database Neo4J.
Now I need to import the data from Oracle to Neo4j.

First, I'll describe my application scenario.

I have just one oracle table with more than 100 million rows.
The table desc is:
id1 varchar, id2 varchar, relation_properpy int.

id1 and id2 are primary key.

The oracle server and Neo4J server are set up on the same machine.

Now how I can create nodes for each id and one directed relationship between id1 and id2 for each row?

As far as I know, there are three ways to do this:
1. Java Rest JDBC API
I've write a code demo and found it's too slow: 100,00 rows per minute.
Besides, it's not easy to establish a Java Environment in 

2. Python Embedded.
I haven't write test code right now, but I think it's not better than Java.

3.Batch Insert
Export the data from oracle as CSV file;
Import the CSV data into Neo4J using Cypher.
I believe it's the fastest way to import data. However, I don't know how to do this. All the demo I've seen on the Internet is about adding nodes but without adding relationships with specific properties.

I wonder is there anybody encounter such scenario? Can you give me some advises? Or is there any better solution to import data?

Thank you very much!

Jesse
Feb 3rd, 2015

Liliana Ziolek

unread,
Feb 3, 2015, 6:13:09 PM2/3/15
to ne...@googlegroups.com
Can you use latest version of Neo4j (i.e. 2.2.0-M03) ? It's not yet officially released but if you don't need it production ready yet, then it should do.
Importing in it is a real doddle:
Basically, you prepare CSV files with headers - (at least) 1 file for nodes, and 1 file for relationships. Nodes (even if only with id on them) have to be created for the relationships, otherwise you'll get errors. Then you run the files through a command line tool which is distributed as part of the Neo4j release. Sorted. Oh, and it's SUPER fast too. :)

Hope this helps.

Michael Hunger

unread,
Feb 3, 2015, 6:43:45 PM2/3/15
to ne...@googlegroups.com
Hi Jesse,

there are some tips on the website, http://neo4j.com/developer/guide-import-csv/

Do you know how to create a CSV from your relational table?

I agree, the batch-importer makes most sense there.

based on the table

id1 varchar, id2 varchar rel_property int

If you create a csv file for the nodes

select id1 as "id:ID", "User" as ":LABEL" from table
union
select id2 as "id:ID", "User" as ":LABEL" from table

and for the relationships a csv

select id1 as ":START_ID", id2 as ":END_ID", rel_property as "value:INT", "LINKS_TO" as ":TYPE" from table

and then use the new batch-importer that comes with neo4j 2.2

bin/neo4j-import --nodes nodes.csv --relationships relationships.csv --id-type string --into test.db


If you can't use it, I suggest something like my groovy script here:

--
You received this message because you are subscribed to the Google Groups "Neo4j" group.
To unsubscribe from this group and stop receiving emails from it, send an email to neo4j+un...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Paul Shoemaker

unread,
Feb 5, 2015, 3:50:38 PM2/5/15
to ne...@googlegroups.com
For what it's worth, I ended up using embedded java to write directly to the graph while the server is detached.  This ended up giving me the fastest performance as I found the REST interface way too slow for large data sets (> 1M records).  I'm still not really happy with the performance, but I was able to achieve 20 - 25 atomic transactions per second while creating 6 nodes (with indexes) with 6 relationships.  On 5 of the nodes, there was an indexed lookup step (Index object) as those nodes needed to be unique (they were location nodes - city, state, zip, etc).  For 1.4M nodes total, or approximately 1.3M postgres db records, the process took around 16 hours.  With the REST api, I noted approximately 30ms - 90ms for each node creation, which would have taken approximately 24 hours on the low end and approximately 36 hours on the high end to insert.  

Does my performance seem consistent with reality or is there something obvious that I'm missing?

I'm going to run a test of something like 50 - 100 concurrent REST transactions against the server to see if I can speed that up.  I typically use the multiprocessing module in python or a rabbitmq exchange for such an operation.

It's unfortunate that the new import tool included with 2.2 can only write to a new graph db store.  Our use case is graph-assisted data analysis to a unified store (with logical separation of domains by a root node), so we need to take advantage of the additive nature of the graph when batch loading data.

Paul

Michael Hunger

unread,
Feb 6, 2015, 3:58:53 AM2/6/15
to ne...@googlegroups.com
It should be much much faster.

1. use larger transactions (10k elements) to _batch_ your inserts
2. 2.2 supports much better concurrent/smaller transactions scaling, e.g. I created 10M nodes in 40s with concurrent small transactions (2 nodes 1 rel).

if you can share your code, we can have a look. Index lookups hurt something, true.
also share your config (heap, mmio settings etc) best would be graph.db/messages.log

Cheers, Michael

Paul Shoemaker

unread,
Feb 6, 2015, 10:20:22 AM2/6/15
to ne...@googlegroups.com
Hi Michael,

Thank you so much for taking valuable time to assist.  I have attached the relevant code and messages.log.

I will admit that this is not particularly elegant, but it is more POC than anything.  You will see that I am using a transaction per loop trip while iterating through my db resultset.  If that is a particularly expensive operation (I have a suspicion that it might be), I could create pressure by batching 10k at a time.  Would I benefit from using the BatchInserter process?  I am going to try this today and see if I yield better results.

I would love to use the new batch importer, but unfortunately, as I understand, it requires a clean graph and cannot append to a graph that already exists.  For our needs, we will be performing large data imports to the same graph and they will come at asynchronous times.  Perhaps I could use the new concurrency in 2.2.  Can you please point me to how you can insert concurrently into the same graph file?  My tests have all given me issues because once the file has been opened, it is locked and cannot be accessed by another process.  Or, perhaps, I should open the file and thread out the transactional operations?  I will try this today, as well.

Thanks again!

Paul


You received this message because you are subscribed to a topic in the Google Groups "Neo4j" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/neo4j/Py7hrc5Jf8U/unsubscribe.
To unsubscribe from this group and all its topics, send an email to neo4j+un...@googlegroups.com.
messages.log
Graph.java

Paul Shoemaker

unread,
Feb 9, 2015, 5:43:12 PM2/9/15
to ne...@googlegroups.com
Hi Michael,

Just wanted to give a quick update that I have improved heavily on my code that I submitted and have achieved much higher throughput levels on a single thread.  Thank you so much, again, for your suggestion to batch 10k transactions at once.  Now on to a multi-threaded approach :-)

Paul
To unsubscribe from this group and stop receiving emails from it, send an email to neo4j+unsubscribe@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to a topic in the Google Groups "Neo4j" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/neo4j/Py7hrc5Jf8U/unsubscribe.
To unsubscribe from this group and all its topics, send an email to neo4j+unsubscribe@googlegroups.com.

Michael Hunger

unread,
Feb 9, 2015, 5:51:53 PM2/9/15
to ne...@googlegroups.com
Good luck, ping me when you worked it out :)

Cheers, Michael

Jesse Liu

unread,
Feb 9, 2015, 10:39:43 PM2/9/15
to ne...@googlegroups.com
Hi, All,
 
Thanks for all the helps!

 

Now I consider Ziolek's opinion: use the latest version of Neo4j and use import tool.
According to the import tool examples, I've also update my scenario, as described below.
 
Actually, I want to build social network circles.
I've six months' data in the oracle database, stored as six tables named TH_07, TH_08, TH_09, TH_10, TH_11, TH_12 respectly.
Every table has the same description:
id1 varchar, id2 varchar, relationship_property int, primary key is {id1, id2}
P.S. There may be exactly the same {id1, id2} pair between different tables, but with different relationship_property, e.g. there is one and only one record {ABC, XYZ, 10} in TH_07, and one and only record {ABC, XYZ, int} in other tables like TH_09.
Each tables has about 80~90 million rows!
 
By the way, I set up the Neo4j database and oracle on exactly the same machine with 256GB RAM and 64-core CPU.
 
I want to build a graph database which each id1 and id2 represent a node, and if there is a record (id1, id2, relation_property) in oracle, create a relationship between id1 and id2 with relation_property.
 
The First Question:
According to http://neo4j.com/docs/2.2.0-M03/import-tool-examples.html, first of all I should export the node from oracle into csv file.
I need UNIQUE node with id, so I have three choices:
1. use DISTINCT in oracle, but I have six tables so it's very hard;
2. use MERGE in Cypher, but it's too slow! I cannot stand the low effiency;
3. use Python to connect to oracle, and preprocess the data in Python (since I've 256GB RAM it's possible to process such big data)
Is it possible to import 7.5 billion nodes once from csv file?
 
The Second Question:
How can I update the relationship_property? For example, I've {ABC, XYZ, 10} in table TH_07, and {ABC, XYZ, 20} in table TH_08, so I hope update relationship between {ABC} and {XYZ} is 10+20 = 30 for simplicity.
1. Process it also in Python?
2. Can I do this in Cypher?
 
The Third Question:
I've tried LOAD CSV in Neo4j -2.1.6-community version.
The Cypher language is exactly shown below:
USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM 'FILEPATH' AS ROW
CREATE (n:User {id: row.id1})
CREATE (m:User {id:row.id2})
 
However, during the processing, I've encountered error such as "Kenel error, please restart or recover" something like that (Sorry I did not record the error)
 
The Last Question:
How can I set the Neo4j Server Configuration? As you know, I've 7.5 billion nodes and about 100 billion relationships. After importing the data, I should do such computation, such as Degree Centrality, Betweenness Centrality, Closeness Centrality and something like this.
How can I use my computer efficiently?
 
Thank you!
 
Yours, Jesse

 
 
 
 

在 2015年2月3日星期二 UTC+8下午4:18:39,Jesse Liu写道:

Chris Vest

unread,
Feb 10, 2015, 4:14:55 AM2/10/15
to ne...@googlegroups.com
That’s a nice big dataset.

I would output the tables to CSV files in sorted order by {id1, id2}. Then I’d preprocess them with Python and generate CSV files that import-tool can eat. I’d probably keep the files compressed on disk at all times – import-tool in 2.2-M03 can eat .csv.zip and .csv.gz files just fine. A big portion of this import is going to IO bound, so compression will speed that up. You have so many CPU cores, anyway.

But before that, because I know that the 6 table files are sorted by {id1, id2}, I can process them in parallel in a merge-sort-like way, where I grab the input line with the least key (assuming ascending order) to process, and if more than one file are currently at that key, then I merge the inputs as you described.

Another trick I’d try pulling, since we are pre-processing the data anyway, is to assign the generated node and relationship records unique integer id’s. This would allow you to tell import-tool to use --id-type ACTUAL, which means it won’t have to spend time maintaining a mapping between internal record ids, and the varchar ids in your dataset. This will speed up the import.

If for every {ABC, XYZ, ?} record, there’s also a {XYZ, ABC, ?} record, then the node ids will be easy to generate, as they can just be a count of how many different id1 values you’ve seen. If that assumption does not hold, however, then you also need to look at all the id2 values to generate node records, which is annoying since they don’t come in sorted order, which in turn means that you need to somehow filter out values you’ve already seen (or otherwise deterministically compute a distinct integer from the value). But all this is only if you want to use --id-type ACTUAL. I don’t know if that’s possible for you. Otherwise the import will just take a bit longer.

One other thing: you can’t put more than 2 billion nodes in a schema index, so using LOAD CSV with MERGE won’t work for a dataset this big. This is a limitation of the version of Lucene that we use. (I don’t know if newer versions of Lucene lift this limit, but we plan on addressing it in the future regardless.)

--
Chris Vest
System Engineer, Neo Technology
[ skype: mr.chrisvest, twitter: chvest ]


Maryam Gerami

unread,
Feb 27, 2016, 7:43:50 PM2/27/16
to Neo4j
Hi

I have a medium size dataset with 100,000 rows and I use this command for importing data from csv file to graph database

LOAD CSV WITH HEADERS FROM "file:///E:/datasets/Actors_data_all.csv" AS row
MERGE (c:Country {Name:row.Country})
MERGE (a:Actor {Name: row.ActorName, Aliases: row.Aliases, Type: row.ActorType})
MERGE (o:Organization {Name: row.AffiliationTo})
MERGE (a)-[:AFFILIATED_TO {Start: row.AffiliationStartDate, End: row.AffiliationEndDate}]->(o)
MERGE(c)<-[:IS_FROM]-(a);

My PC has 8GB RAM

Is it normal to take more than 6 hours to create 42000 nodes and 100,000 relationships ? If not would you please help me to find out how to fix this problem and import data from csv file faster?

Thanks in advance
--
Maryam Gerami
R&D department, Informatics services corporation

Michael Hunger

unread,
Feb 27, 2016, 7:58:31 PM2/27/16
to ne...@googlegroups.com
Hey, 

It should just take a few seconds.

I presume:

you use Neo4j 2.3.2 ?
you created indexes / constraints for the things you merge on ?
you configured your neo4j instance to run with at least 4G of heap?
you are using PERIODIC COMMIT ?

I suggest that you run a profile on your statement to see where the biggest issues show up.

Otherwise it is very recommended to split it up.

e.g. like this:

CREATE CONSTRAINT ON (c:Country) ASSERT c.Name IS UNIQUE;
CREATE CONSTRAINT ON (o:Organization) ASSERT o.Name IS UNIQUE;
CREATE CONSTRAINT ON (a:Actor) ASSERT a.Name IS UNIQUE;


LOAD CSV WITH HEADERS FROM "file:///E:/datasets/Actors_data_all.csv" AS row
WITH distinct row.Country as Country
MERGE (c:Country {Name:Country});

LOAD CSV WITH HEADERS FROM "file:///E:/datasets/Actors_data_all.csv" AS row
WITH distinct row.AffiliationTo as AffiliationTo
MERGE (o:Organization {Name: AffiliationTo});

LOAD CSV WITH HEADERS FROM "file:///E:/datasets/Actors_data_all.csv" AS row
MERGE (a:Actor {Name: row.ActorName}) ON CREATE SET a.Aliases=row.Aliases, a.Type=row.ActorType;

LOAD CSV WITH HEADERS FROM "file:///E:/datasets/Actors_data_all.csv" AS row
WITH distinct row.Country as Country, row.ActorName as ActorName
MATCH (c:Country {Name:Country})
MATCH (a:Actor {Name:ActorName})
MERGE(c)<-[:IS_FROM]-(a);

LOAD CSV WITH HEADERS FROM "file:///E:/datasets/Actors_data_all.csv" AS row
MATCH (o:Organization {Name: row.AffiliationTo})
MATCH (a:Actor {Name: row.ActorName})
MERGE (a)-[r:AFFILIATED_TO]->(o) 
  ON CREATE SET r.Start=row.AffiliationStartDate, r.End=row.AffiliationEndDate;
Reply all
Reply to author
Forward
0 new messages