Creating graph with Neo4j graph database takes too long

378 views
Skip to first unread message

Sotiris Beis

unread,
Jan 30, 2014, 6:04:35 AM1/30/14
to ne...@googlegroups.com
I use the following code to create a graph with Neo4j Graph Database:

import java.io.BufferedReader;
import java.io.FileInputStream;
import java.io.IOException;
import java.io.InputStreamReader;
import java.util.HashMap;
import java.util.Map;

import org.neo4j.graphdb.RelationshipType;
import org.neo4j.graphdb.index.IndexHits;
import org.neo4j.helpers.collection.MapUtil;
import org.neo4j.index.lucene.unsafe.batchinsert.LuceneBatchInserterIndexProvider;
import org.neo4j.unsafe.batchinsert.BatchInserter;
import org.neo4j.unsafe.batchinsert.BatchInserterIndex;
import org.neo4j.unsafe.batchinsert.BatchInserterIndexProvider;
import org.neo4j.unsafe.batchinsert.BatchInserters;


public class Neo4jMassiveInsertion implements Insertion {
   
   
private BatchInserter inserter = null;
   
private BatchInserterIndexProvider indexProvider = null;
   
private BatchInserterIndex nodes = null;
   
   
private static enum RelTypes implements RelationshipType {
        SIMILAR
   
}
   
   
public static void main(String args[]) {
       
Neo4jMassiveInsertion test = new Neo4jMassiveInsertion();
        test
.startup("data/neo4j");
        test
.createGraph("data/youtubeEdges.txt");
        test
.shutdown();
   
}
   
   
/**
     * Start neo4j database and configure for massive insertion
     * @param neo4jDBDir
     */

   
public void startup(String neo4jDBDir) {
       
System.out.println("The Neo4j database is now starting . . . .");
       
Map<String, String> config = new HashMap<String, String>();
        config
.put("cache_type", "none");
        config
.put("use_memory_mapped_buffers", "true");
        config
.put("neostore.nodestore.db.mapped_memory", "200M");
        config
.put("neostore.relationshipstore.db.mapped_memory", "1000M");
        config
.put("neostore.propertystore.db.mapped_memory", "250M");
        config
.put("neostore.propertystore.db.strings.mapped_memory", "250M");
        inserter
= BatchInserters.inserter(neo4jDBDir, config);
        indexProvider
= new LuceneBatchInserterIndexProvider(inserter);
        nodes
= indexProvider.nodeIndex("nodes", MapUtil.stringMap("type", "exact"));
   
}
   
   
public void shutdown() {
       
System.out.println("The Neo4j database is now shuting down . . . .");
       
if(inserter != null) {
            indexProvider
.shutdown();
            inserter
.shutdown();
            indexProvider
= null;
            inserter
= null;
       
}
   
}
   
   
public void createGraph(String datasetDir) {
       
System.out.println("Creating the Neo4j database . . . .");
       
try {
           
BufferedReader reader = new BufferedReader(new InputStreamReader(new FileInputStream(datasetDir)));
           
String line;
           
int lineCounter = 1;
           
Map<String, Object> properties;
           
IndexHits<Long> cache;
           
long srcNode, dstNode;
           
while((line = reader.readLine()) != null) {
               
if(lineCounter > 4) {
                   
String[] parts = line.split("\t");
                    cache
= nodes.get("nodeId", parts[0]);
                   
if(cache.hasNext()) {
                        srcNode
= cache.next();
                   
}
                   
else {
                        properties
= MapUtil.map("nodeId", parts[0]);
                        srcNode
= inserter.createNode(properties);
                        nodes
.add(srcNode, properties);
                        nodes
.flush();
                   
}
                    cache
= nodes.get("nodeId", parts[1]);
                   
if(cache.hasNext()) {
                        dstNode
= cache.next();
                   
}
                   
else {
                        properties
= MapUtil.map("nodeId", parts[1]);
                        dstNode
= inserter.createNode(properties);
                        nodes
.add(dstNode, properties);
                        nodes
.flush();
                   
}
                    inserter
.createRelationship(srcNode, dstNode, RelTypes.SIMILAR, null);
               
}
                lineCounter
++;
           
}
            reader
.close();
       
}
       
catch (IOException e) {
            e
.printStackTrace();
       
}
   
}
}

Comparing with other graph database technologies (titan, orientdb) it needs too much time. So may i am doing something wrong. Is there a way to boost up the procedure? I use neo4j 1.9.5 and my machine has a 2.3 Ghz CPU (i5), 4GB RAM and 320GB disk and I am running on Macintosh OSX Mavericks (10.9). Also my heap size is at 2GB.

I have tried to replace the lucene index with a List or a HashMap to lookup for my nodes but this make thing much more worse. To be more specific I used the following test code:
public void createGraph(String datasetDir) {
       
System.out.println("Creating the Neo4j database . . . .");
       
try {
           
BufferedReader reader = new BufferedReader(new InputStreamReader(new FileInputStream(datasetDir)));
           
String line;
           
int lineCounter = 1;
           
Map<String, Object> properties;
           
List<Long> index = new ArrayList<Long>();
           
long srcNode, dstNode;
           
while((line = reader.readLine()) != null) {
               
if(lineCounter > 4) {
                   
String[] parts = line.split("\t");

                   
if(index.contains(Long.valueOf(parts[0]))) {
                        srcNode
= Long.valueOf(parts[0]);
                   
}
                   
else {
                        properties
= MapUtil.map("nodeId", parts[0]);
                        srcNode
= inserter.createNode(properties);
                       
//nodes.add(srcNode, properties);
                        index
.add(srcNode);
                   
}
                   
                   
if(index.contains(Long.valueOf(parts[1]))) {
                        dstNode
= Long.valueOf(parts[1]);
                   
}
                   
else {
                        properties
= MapUtil.map("nodeId", parts[1]);
                        dstNode
= inserter.createNode(properties);
                       
//nodes.add(dstNode, properties);
                        index
.add(dstNode);
                   
}
                   
                    inserter
.createRelationship(srcNode, dstNode, RelTypes.SIMILAR, null);
               
}
                lineCounter
++;
           
}
            reader
.close();
       
}
       
catch (IOException e) {
            e
.printStackTrace();
       
}
   
}

But it needs the twice the time.

Thanks in advance,

Sotiris


Michael Hunger

unread,
Jan 30, 2014, 6:14:41 AM1/30/14
to ne...@googlegroups.com
Hi Sotiris,

Any chance to share your input file?

I don't understand your index usage. ArrayList has a O(n) lookup complexity? Why would you do that?
Also I think you'll never find anything in your index if your database is not empty initially and your data set input is totally ordered by min(start,end)

I would also recommend renaming nodeId to "id" or youtubeId or similar. To reduce confusion.

I recommended:

Map<Long,Long> cache = new HashMap<>();

private long getOrCreateNode(String value) {
   Long id = cache.get(value);
   if (id == null) {
       Map props = map("nodeId",value);
       id = inserter.createNode(props);
       index.add(id, props);
       cache.put(value, id);
   }
   return id;
}

You can still add your nodeId to the lucene index for later lookup mechanisms but just don't flush on every insert and don't read from lucene during high performance insert.

Michael

--
You received this message because you are subscribed to the Google Groups "Neo4j" group.
To unsubscribe from this group and stop receiving emails from it, send an email to neo4j+un...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Sotiris Beis

unread,
Jan 30, 2014, 7:57:11 AM1/30/14
to ne...@googlegroups.com
Great Michael,

following your suggestions I am now able to load my data in 7 sec.

Thanks,
Sotiris
You received this message because you are subscribed to a topic in the Google Groups "Neo4j" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/neo4j/uZqRgCBc9lg/unsubscribe.
To unsubscribe from this group and all its topics, send an email to neo4j+un...@googlegroups.com.

Michael Hunger

unread,
Jan 30, 2014, 8:03:09 AM1/30/14
to ne...@googlegroups.com
How big is the file?
How does this compare to the other dbs?

Michael

Sotiris Beis

unread,
Jan 30, 2014, 8:14:27 AM1/30/14
to ne...@googlegroups.com
This is my test dataset http://snap.stanford.edu/data/amazon0601.html
Titan needs 26 sec and needs over 2 minutes. Of course I must search a little bit if there are some more performance tunings I should make. The results will probably published so If you care I can share the paper when it is published.

Sotiris

Sotiris Beis

unread,
Feb 3, 2014, 9:40:01 AM2/3/14
to ne...@googlegroups.com
In addition to my previous question is there any suggestion to improve the performance of the following code. I use this to simulate the needed time to create a graph if the graph is created by single insertions (incrementally not batch). So I commit every transaction and I measure the time that a block needs to be inserted. A block consists of a 1,000 nodes and their edges. Here is the code:

import java.io.BufferedReader;
import java.io.FileInputStream;
import java.io.IOException;
import java.io.InputStreamReader;
import java.util.ArrayList;
import java.util.List;

import org.neo4j.graphdb.GraphDatabaseService;
import org.neo4j.graphdb.Node;
import org.neo4j.graphdb.RelationshipType;
import org.neo4j.graphdb.Transaction;
import org.neo4j.graphdb.factory.GraphDatabaseFactory;
import org.neo4j.graphdb.factory.GraphDatabaseSetting;
import org.neo4j.graphdb.index.Index;

public class Neo4jSingleInsertion implements Insertion {

   
public static String INSERTION_TIMES_OUTPUT_PATH = "data/neo4j.insertion.times";
   
   
private static int count;
   
   
private GraphDatabaseService neo4jGraph = null;
   
private Index<Node> nodeIndex = null;

   
   
private static enum RelTypes implements RelationshipType {
        SIMILAR
   
}
   
   
public static void main(String args[]) {

       
Neo4jSingleInsertion test = new Neo4jSingleInsertion();
        test
.startup("data/neo4j");
        test
.createGraph("data/enronEdges.txt");
        test
.shutdown();

   
}
   
   
public void startup(String neo4jDBDir) {
       
System.out.println("The Neo4j database is now starting . . . .");

        neo4jGraph
= new GraphDatabaseFactory().newEmbeddedDatabase(neo4jDBDir);
        nodeIndex
= neo4jGraph.index().forNodes("nodes");

   
}
   
   
public void shutdown() {
       
System.out.println("The Neo4j database is now shuting down . . . .");

       
if(neo4jGraph != null) {
            neo4jGraph
.shutdown();
            nodeIndex
= null;
       
}
   
}
   
   
public void createGraph(String datasetDir) {
        count
++;
       
System.out.println("Incrementally creating the Neo4j database . . . .");
       
List<Double> insertionTimes = new ArrayList<Double>();

       
try {
           
BufferedReader reader = new BufferedReader(new InputStreamReader(new FileInputStream(datasetDir)));
           
String line;

           
int nodesCounter = 0;
           
int lineCounter = 1;
           
Transaction tx = null;
           
long start = System.currentTimeMillis();
           
long duration;

           
while((line = reader.readLine()) != null) {
               
if(lineCounter > 4) {
                   
String[] parts = line.split("\t");

                   
                   
Node srcNode = nodeIndex.get("nodeId", parts[0]).getSingle();
                   
if(srcNode == null) {
                        tx
= neo4jGraph.beginTx();
                        srcNode
= neo4jGraph.createNode();
                        srcNode
.setProperty("nodeId", parts[0]);
                        nodeIndex
.add(srcNode, "nodeId", parts[0]);
                        tx
.success();
                        tx
.finish();
                        nodesCounter
++;
                   
}
                   
                   
if(nodesCounter == 1000) {
                        duration
= System.currentTimeMillis() - start;
                        insertionTimes
.add((double) duration);
                        nodesCounter
= 0;
                        start
= System.currentTimeMillis();
                   
}
                   
                   
Node dstNode = nodeIndex.get("nodeId", parts[1]).getSingle();
                   
if(dstNode == null) {
                        tx
= neo4jGraph.beginTx();
                        dstNode
= neo4jGraph.createNode();
                        dstNode
.setProperty("nodeId", parts[1]);
                        nodeIndex
.add(dstNode, "nodeId", parts[1]);
                        tx
.success();
                        tx
.finish();
                        nodesCounter
++;
                   
}
                   
                    tx
= neo4jGraph.beginTx();
                    srcNode
.createRelationshipTo(dstNode, RelTypes.SIMILAR);
                    tx
.success();
                    tx
.finish();
                   
                   
if(nodesCounter == 1000) {
                        duration
= System.currentTimeMillis() - start;
                        insertionTimes
.add((double) duration);
                        nodesCounter
= 0;
                        start
= System.currentTimeMillis();
                   
}
               
}
                lineCounter
++;
           
}
            duration
= System.currentTimeMillis() - start;
            insertionTimes
.add((double) duration);

            reader
.close();
       
}
       
catch (IOException e) {
            e
.printStackTrace();
       
}

       
Utils utils = new Utils();
        utils
.writeTimes(insertionTimes, Neo4jSingleInsertion.INSERTION_TIMES_OUTPUT_PATH+"."+count);
   
}
   
}

Thanks,
Sotiris

Michael Hunger

unread,
Feb 3, 2014, 11:41:18 AM2/3/14
to ne...@googlegroups.com
Why do you do individual inserts when you have blocks of data?

You can often aggregate events on the application level to be inserted as a bigger batch.

Otherwise you can also release the force-write-to-log constraint and use

Transaction tx = ((GraphDatabaseAPI)neo4jGraph).tx().unforced().begin();

Instead of neo4jGraph.beginTx();


Sotiris Beis

unread,
Feb 4, 2014, 3:16:11 AM2/4/14
to ne...@googlegroups.com
I don't have blocks of data, I measure the insertion time of 1,000 nodes and their edges (which I call a block). I am doing that because I want to simulate the creation of a graph by single element insertion.

This
Transaction tx = ((GraphDatabaseAPI)neo4jGraph).tx().unforced().begin();
does the job, but the tx() function says it's deprecated. Is this going to be a problem?

Sotiris
You received this message because you are subscribed to a topic in the Google Groups "Neo4j" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/neo4j/uZqRgCBc9lg/unsubscribe.
To unsubscribe from this group and all its topics, send an email to neo4j+un...@googlegroups.com.

Michael Hunger

unread,
Feb 4, 2014, 3:38:13 AM2/4/14
to ne...@googlegroups.com
No, it just indicates that the API might change in the future.

Sotiris Beis

unread,
Feb 4, 2014, 3:39:08 AM2/4/14
to ne...@googlegroups.com
Thank you Michael.

Debajyoti Roy

unread,
Feb 6, 2014, 2:06:09 PM2/6/14
to ne...@googlegroups.com
((GraphDatabaseAPI)neo4jGraph).tx().unforced().begin() is awesome but are there any down sides to doing this?

Michael Hunger

unread,
Feb 6, 2014, 5:09:46 PM2/6/14
to ne...@googlegroups.com
As the tx log is not forced to disk you might loose a few seconds of inserted data when it crashes

But the data on disk is still consistent

Sent from mobile device

Debajyoti Roy

unread,
Feb 6, 2014, 5:12:03 PM2/6/14
to ne...@googlegroups.com
Thanks Michael, that makes it crystal clear (i am totally going for it :) )

Sotiris Beis

unread,
Feb 24, 2014, 9:47:19 AM2/24/14
to ne...@googlegroups.com
if i choose this value keep_logical_logs to false is the same as this Transaction tx = ((GraphDatabaseAPI)neo4jGraph).tx().unforced().begin(); ?

Michael Hunger

unread,
Feb 24, 2014, 9:51:08 AM2/24/14
to ne...@googlegroups.com
Just DON'T use unforced() which is anyway not suited and thought for live usage.

Create sensible batches of operations that you execute at once, that's the best solution for now.

Michael

Sotiris Beis

unread,
Feb 24, 2014, 9:53:44 AM2/24/14
to ne...@googlegroups.com
I don't want to insert my data into batch mode, because the needs of the experiment I want to conduct, as I have explained.
You received this message because you are subscribed to a topic in the Google Groups "Neo4j" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/neo4j/uZqRgCBc9lg/unsubscribe.
To unsubscribe from this group and all its topics, send an email to neo4j+un...@googlegroups.com.

Javad Karabi

unread,
Feb 24, 2014, 5:08:26 PM2/24/14
to ne...@googlegroups.com
 Sotiris Beis, check out my project:
github.com/karabijavad/cadet
an example is at:

so, for example, with cadet:

db = Cadet::BatchInserter::Session.open("neo4j-community-2.0.1/data/graph.db")

db.constraint :Legislator, :name

l = db.get_node(:Legislator, :thomas_id, leg["id"]["thomas"].to_i)
gender = db.get_node(:Gender, :name, leg["bio"]["gender"])
l.outgoing(:gender) << gender

db.close

i implement an index in ruby, so you can still find nodes based on label/key/value.

personally, i get about 2k rows a second for importing my csv data, where each row can then create up to 10 other nodes and 10 other rels.

hope this helps

Michael Hunger

unread,
Feb 24, 2014, 5:42:08 PM2/24/14
to ne...@googlegroups.com
With the batch-inserter you should get up to 1M nodes per second and depending on your memory mapping (mmio) settings for the rel-file between 100k and 500k rels/second.

Not doing index lookups that is.

Michael

Javad Karabi

unread,
Feb 24, 2014, 5:45:20 PM2/24/14
to ne...@googlegroups.com
ah wow, so looks like i probably have a lot more optimization i can still squeeze out of neo4j.
michael, i am importing on my laptop, which has a ton of resources available. 
can you suggest some neo4j configuration settings which do not care about other services running on the system, but can be used to give as much resources as possible to neo4j? thanks

Michael Hunger

unread,
Feb 24, 2014, 5:52:18 PM2/24/14
to ne...@googlegroups.com
For batch-insertion:

#1 it doesn't need loads of heap, just enough to pull the data through (and if you have a cache for node-lookups that has to be accommodated for too).
#2 leave enough ram for the OS and fs caches, e.g. 2-4G (depending on the total RAM)
#3 put all other memory in the mmio settings, make sure the node-file is mapped (#nodes * 14bytes) and the rel-file as much as possible (#rels * 33), for properties and strings 500MB per file are good enough 
#4 You should make sure your disk is fast enough (SSD) and has the correct settings, e.g. scheduler on linux.
#5 configure cache_type=none

Javad Karabi

unread,
Feb 24, 2014, 6:20:59 PM2/24/14
to ne...@googlegroups.com
awesome! this is perfect!
one more thing, how can i query the batch inserter database to make sure the configuration was accepted and set?

Michael Hunger

unread,
Feb 24, 2014, 6:35:37 PM2/24/14
to ne...@googlegroups.com
add dump_configuration=true and it should output the config to messages.log 

Michael

Javad Karabi

unread,
Feb 24, 2014, 6:37:22 PM2/24/14
to ne...@googlegroups.com
im trying to programmatically test that it accepted the configuration, though.

Javad Karabi

unread,
Feb 24, 2014, 7:09:53 PM2/24/14
to ne...@googlegroups.com
for example:

i just want to test that neo4j is accepting it, because when i put garbage in the hash, neo4j doesnt complain, so i dont know if its ignoring it or what
Reply all
Reply to author
Forward
0 new messages