Batch Insertion with Neo4j

Maaz

unread,

Dec 11, 2012, 11:27:48 AM12/11/12

to ne...@googlegroups.com

I am importing 2.3 Billion relationship from a table, The import is not very fast getting a speed on 5Million per hour that will take 20 days to complete the migration. I have heard about the neo4j batch insert and and batch insert utility. The utility do interesting stuff by importing from a csv file but the latest code is some how broken and not running.

I have about 100M relations in neo4j and I have to all check that there shall be no duplicate relationship.

How can I fast the things in neo4j

By current code is like

begin transaction
for 50K relationships
create or get user node for user A
create or get user node for user B
check there is relationship KNOW between A to B if not create the relationhsip
end transaction

I have also read the following stuff.

http://docs.neo4j.org/chunked/milestone/batchinsert.html http://stackoverflow.com/questions/13686850/how-to-speed-up-insertion-in-neo4j-from-mysql

Max De Marzi Jr.

unread,

Dec 11, 2012, 5:08:23 PM12/11/12

to ne...@googlegroups.com

Change the POM to use <neo4j.version>1.8</neo4j.version> and it should work.

Michael Hunger

unread,

Dec 11, 2012, 6:08:52 PM12/11/12

to ne...@googlegroups.com

Hi Maaz

You should really use the batch-inserter for this kind of initial data import.

I pushed a fix for the pom of the batch-inserter. There were API changes in Neo4j-SNAPSHOT which caused it to fail.

It should run now for you. There is also a parallel version which is much faster, see the readme. I used it to import several billion (5BN) relationships into neo4j.

Michael

Am 11.12.2012 um 17:27 schrieb Maaz:

I am importing 2.3 Billion relationship from a table, The import is not very fast getting a speed on 5Million per hour that will take 20 days to complete the migration. I have heard about the neo4j batch insert and and batch insert utility. The utility do interesting stuff by importing from a csv file but the latest code is some how broken and not running.
I have about 100M relations in neo4j and I have to all check that there shall be no duplicate relationship.

What do you mean by this " I have to all check that there shall be no duplicate relationship." What is your actual use-case?

Thanks

Michael

How can I fast the things in neo4j
By current code is like
begin transaction
for 50K relationships
create or get user node for user A
create or get user node for user B
check there is relationship KNOW between A to B if not create the relationhsip
end transaction
I have also read the following stuff.
http://docs.neo4j.org/chunked/milestone/batchinsert.html http://stackoverflow.com/questions/13686850/how-to-speed-up-insertion-in-neo4j-from-mysql

--

Maaz

unread,

Dec 12, 2012, 2:30:06 AM12/12/12

to ne...@googlegroups.com

Thanks micheal, Its works but how can I use it if I am running a fresh install and have 100M relations?

Michael Hunger

unread,

Dec 12, 2012, 2:40:30 AM12/12/12

to ne...@googlegroups.com

What do you mean? The parallel-importer ?

It needs a few more parameters and a prerequisite:

rels.csv has to be pre-sorted by min(start,end)

data/dir nodes.csv relationships.csv #nodes #max-props-per-node #usual-rels-pernode #max-rels-per-node #max-props-per-rel rel,types

e.g.

data/graph.db nodes.csv relationships.csv 500000000 2 10 50 1 FOO,BAR

you can run it directly from maven like this:

MEMORY_OPTS="-Xmx50G -Xms50G -server -d64 -Xmn3g -XX:SurvivorRatio=2"

GC_OPTS="-XX:+UseConcMarkSweepGC -XX:+UseParNewGC"

export MAVEN_OPTS="$MEMORY_OPTS $GC_OPTS"

mvn clean compile exec:java -Dexec.mainClass=org.neo4j.batchimport.ParallelImporter -Dexec.args="/mnt/foo.db /mnt/foo-data/nodes.csv /mnt/foo-data/rels.csv 226994686 2 20 50 1 FOO,BAR"

If you have two csv files with nodes and rels that contain that many lines you will get 100M relationships.

Michael

--

Maaz Bin Tariq

unread,

Dec 12, 2012, 3:10:07 AM12/12/12

to ne...@googlegroups.com

Sorry About my last message,
I am using neo4j and currently having 1M indexed nodes with 2 properties and 100M relationships with 0 property and
I want to import 300M nodes and want them to be indexed and 2.3B relationships,

I do not want to duplicated the nodes in 1M and relationships in 100M that I already have.

Node Structure
id: system generated
Properties
Node_type: abc
Node_id: 1771

Index key node_type and value node_id

how can I use import utility to build all this.

Thanks

Thanks

--

Michael Hunger

unread,

Dec 12, 2012, 3:32:31 AM12/12/12

to ne...@googlegroups.com

Hi,

right now that is not possible out of the box.

The batch importer assumes to create a new database.

I would probably create the new database from the new data and then re-add the 1M nodes 100M rels after the fact.

Perhaps you can also just export them as csv and append them to the nodes.csv and rels.csv

You need fro the import 3 files:

nodes.csv, single tab separated

Node_type Node_id

abc 1771

...

rels.csv, single tab separated

start end type

1 2 FOO

....

nodes_index.csv

node type1 type2 type3

1 1771

2 2347

3 347839

...

see: https://github.com/jexp/batch-import#file-format

and: https://github.com/jexp/batch-import#indexing

java -server -Xmx30G -jar ../batch-import/target/batch-import-jar-with-dependencies.jar neo4j/data/graph.db nodes.csv rels.csv node_index index exact nodes_index.csv

Make sure to have enough memory for the heap and configure the mmio settings in batch.properties to take almost all of it,e.g.

dump_configuration=false
cache_type=none
use_memory_mapped_buffers=true
neostore.propertystore.db.index.keys.mapped_memory=5M
neostore.propertystore.db.index.mapped_memory=5M
neostore.nodestore.db.mapped_memory=5G
neostore.relationshipstore.db.mapped_memory=20G
neostore.propertystore.db.mapped_memory=5G
neostore.propertystore.db.strings.mapped_memory=100M

HTH

What is your use-case?

Michael

--

Maaz Bin Tariq

unread,

Dec 12, 2012, 5:55:28 AM12/12/12

to ne...@googlegroups.com

I have 2 types of nodes people and places,While giving it a try I think it is also not suitable for my need in relationship building due to id problem, The utility is dependent on graph node id, and currently I was reading the graph node id from node_type and node_mysql_id from index.

how to make relationship between people and place in relationship.cvs if I am not sure about the nodeId ?
Nodes.cvs

Node_type Node_mysql_id

people 1

place 1 Thanks

--

Reply all

Reply to author

Forward