Creating graph with Neo4j graph database takes too long

Sotiris Beis

unread,

Jan 30, 2014, 6:04:35 AM1/30/14

to ne...@googlegroups.com

I use the following code to create a graph with Neo4j Graph Database:

import java.io.BufferedReader;
import java.io.FileInputStream;
import java.io.IOException;
import java.io.InputStreamReader;
import java.util.HashMap;
import java.util.Map;

import org.neo4j.graphdb.RelationshipType;
import org.neo4j.graphdb.index.IndexHits;
import org.neo4j.helpers.collection.MapUtil;
import org.neo4j.index.lucene.unsafe.batchinsert.LuceneBatchInserterIndexProvider;
import org.neo4j.unsafe.batchinsert.BatchInserter;
import org.neo4j.unsafe.batchinsert.BatchInserterIndex;
import org.neo4j.unsafe.batchinsert.BatchInserterIndexProvider;
import org.neo4j.unsafe.batchinsert.BatchInserters;


public class Neo4jMassiveInsertion implements Insertion {
    
    private BatchInserter inserter = null;
    private BatchInserterIndexProvider indexProvider = null;
    private BatchInserterIndex nodes = null;
    
    private static enum RelTypes implements RelationshipType {
        SIMILAR
    }
    
    public static void main(String args[]) {
        Neo4jMassiveInsertion test = new Neo4jMassiveInsertion();
        test.startup("data/neo4j");
        test.createGraph("data/youtubeEdges.txt");
        test.shutdown();
    }
    
    /**
     * Start neo4j database and configure for massive insertion
     * @param neo4jDBDir
     */
    public void startup(String neo4jDBDir) {
        System.out.println("The Neo4j database is now starting . . . .");
        Map<String, String> config = new HashMap<String, String>();
        config.put("cache_type", "none");
        config.put("use_memory_mapped_buffers", "true");
        config.put("neostore.nodestore.db.mapped_memory", "200M");
        config.put("neostore.relationshipstore.db.mapped_memory", "1000M");
        config.put("neostore.propertystore.db.mapped_memory", "250M");
        config.put("neostore.propertystore.db.strings.mapped_memory", "250M");
        inserter = BatchInserters.inserter(neo4jDBDir, config);
        indexProvider = new LuceneBatchInserterIndexProvider(inserter);
        nodes = indexProvider.nodeIndex("nodes", MapUtil.stringMap("type", "exact"));
    }
    
    public void shutdown() {
        System.out.println("The Neo4j database is now shuting down . . . .");
        if(inserter != null) {
            indexProvider.shutdown();
            inserter.shutdown();
            indexProvider = null;
            inserter = null;
        }
    }
    
    public void createGraph(String datasetDir) {
        System.out.println("Creating the Neo4j database . . . .");
        try {
            BufferedReader reader = new BufferedReader(new InputStreamReader(new FileInputStream(datasetDir)));
            String line;
            int lineCounter = 1;
            Map<String, Object> properties;
            IndexHits<Long> cache;
            long srcNode, dstNode;
            while((line = reader.readLine()) != null) {
                if(lineCounter > 4) {
                    String[] parts = line.split("\t");
                    cache = nodes.get("nodeId", parts[0]);
                    if(cache.hasNext()) {
                        srcNode = cache.next();
                    }
                    else {
                        properties = MapUtil.map("nodeId", parts[0]);
                        srcNode = inserter.createNode(properties);
                        nodes.add(srcNode, properties);
                        nodes.flush();
                    }
                    cache = nodes.get("nodeId", parts[1]);
                    if(cache.hasNext()) {
                        dstNode = cache.next();
                    }
                    else {
                        properties = MapUtil.map("nodeId", parts[1]);
                        dstNode = inserter.createNode(properties);
                        nodes.add(dstNode, properties);
                        nodes.flush();
                    }
                    inserter.createRelationship(srcNode, dstNode, RelTypes.SIMILAR, null);
                }
                lineCounter++;
            }
            reader.close();
        } 
        catch (IOException e) {
            e.printStackTrace();
        }
    }
}

Comparing with other graph database technologies (titan, orientdb) it needs too much time. So may i am doing something wrong. Is there a way to boost up the procedure? I use neo4j 1.9.5 and my machine has a 2.3 Ghz CPU (i5), 4GB RAM and 320GB disk and I am running on Macintosh OSX Mavericks (10.9). Also my heap size is at 2GB.

I have tried to replace the lucene index with a List or a HashMap to lookup for my nodes but this make thing much more worse. To be more specific I used the following test code:

public void createGraph(String datasetDir) {
        System.out.println("Creating the Neo4j database . . . .");
        try {
            BufferedReader reader = new BufferedReader(new InputStreamReader(new FileInputStream(datasetDir)));
            String line;
            int lineCounter = 1;
            Map<String, Object> properties;
            List<Long> index = new ArrayList<Long>();
            long srcNode, dstNode;
            while((line = reader.readLine()) != null) {
                if(lineCounter > 4) {
                    String[] parts = line.split("\t");

                    if(index.contains(Long.valueOf(parts[0]))) {
                        srcNode = Long.valueOf(parts[0]);
                    }
                    else {
                        properties = MapUtil.map("nodeId", parts[0]);
                        srcNode = inserter.createNode(properties);
                        //nodes.add(srcNode, properties);
                        index.add(srcNode);
                    }
                    
                    if(index.contains(Long.valueOf(parts[1]))) {
                        dstNode = Long.valueOf(parts[1]);
                    }
                    else {
                        properties = MapUtil.map("nodeId", parts[1]);
                        dstNode = inserter.createNode(properties);
                        //nodes.add(dstNode, properties);
                        index.add(dstNode);
                    }
                    
                    inserter.createRelationship(srcNode, dstNode, RelTypes.SIMILAR, null);
                }
                lineCounter++;
            }
            reader.close();
        } 
        catch (IOException e) {
            e.printStackTrace();
        }
    }

But it needs the twice the time.

Thanks in advance,

Sotiris

Michael Hunger

unread,

Jan 30, 2014, 6:14:41 AM1/30/14

to ne...@googlegroups.com

Hi Sotiris,

Any chance to share your input file?

I don't understand your index usage. ArrayList has a O(n) lookup complexity? Why would you do that?

Also I think you'll never find anything in your index if your database is not empty initially and your data set input is totally ordered by min(start,end)

I would also recommend renaming nodeId to "id" or youtubeId or similar. To reduce confusion.

I recommended:

Map<Long,Long> cache = new HashMap<>();

private long getOrCreateNode(String value) {

Long id = cache.get(value);

if (id == null) {

Map props = map("nodeId",value);

id = inserter.createNode(props);

index.add(id, props);

cache.put(value, id);

}

return id;

}

You can still add your nodeId to the lucene index for later lookup mechanisms but just don't flush on every insert and don't read from lucene during high performance insert.

Michael

--
You received this message because you are subscribed to the Google Groups "Neo4j" group.
To unsubscribe from this group and stop receiving emails from it, send an email to neo4j+un...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Sotiris Beis

unread,

Jan 30, 2014, 7:57:11 AM1/30/14

to ne...@googlegroups.com

Great Michael,

following your suggestions I am now able to load my data in 7 sec.

Thanks,
Sotiris

You received this message because you are subscribed to a topic in the Google Groups "Neo4j" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/neo4j/uZqRgCBc9lg/unsubscribe.
To unsubscribe from this group and all its topics, send an email to neo4j+un...@googlegroups.com.

Michael Hunger

unread,

Jan 30, 2014, 8:03:09 AM1/30/14

to ne...@googlegroups.com

How big is the file?

How does this compare to the other dbs?

Michael

Sotiris Beis

unread,

Jan 30, 2014, 8:14:27 AM1/30/14

to ne...@googlegroups.com

This is my test dataset http://snap.stanford.edu/data/amazon0601.html
Titan needs 26 sec and needs over 2 minutes. Of course I must search a little bit if there are some more performance tunings I should make. The results will probably published so If you care I can share the paper when it is published.

Sotiris

Sotiris Beis

unread,

Feb 3, 2014, 9:40:01 AM2/3/14

to ne...@googlegroups.com

In addition to my previous question is there any suggestion to improve the performance of the following code. I use this to simulate the needed time to create a graph if the graph is created by single insertions (incrementally not batch). So I commit every transaction and I measure the time that a block needs to be inserted. A block consists of a 1,000 nodes and their edges. Here is the code:

import java.io.BufferedReader;
import java.io.FileInputStream;
import java.io.IOException;
import java.io.InputStreamReader;

import java.util.ArrayList;
import java.util.List;

import org.neo4j.graphdb.GraphDatabaseService;
import org.neo4j.graphdb.Node;
import org.neo4j.graphdb.RelationshipType;
import org.neo4j.graphdb.Transaction;
import org.neo4j.graphdb.factory.GraphDatabaseFactory;
import org.neo4j.graphdb.factory.GraphDatabaseSetting;
import org.neo4j.graphdb.index.Index;

public class Neo4jSingleInsertion implements Insertion {

    public static String INSERTION_TIMES_OUTPUT_PATH = "data/neo4j.insertion.times";
    
    private static int count;
    
    private GraphDatabaseService neo4jGraph = null;
    private Index<Node> nodeIndex = null;


    
    private static enum RelTypes implements RelationshipType {
        SIMILAR
    }
    
    public static void main(String args[]) {


        Neo4jSingleInsertion test = new Neo4jSingleInsertion();
        test.startup("data/neo4j");
        test.createGraph("data/enronEdges.txt");
        test.shutdown();


    }
    
    public void startup(String neo4jDBDir) {
        System.out.println("The Neo4j database is now starting . . . .");


        neo4jGraph = new GraphDatabaseFactory().newEmbeddedDatabase(neo4jDBDir);
        nodeIndex = neo4jGraph.index().forNodes("nodes");


    }
    
    public void shutdown() {
        System.out.println("The Neo4j database is now shuting down . . . .");


        if(neo4jGraph != null) {
            neo4jGraph.shutdown();
            nodeIndex = null;
        }
    }
    
    public void createGraph(String datasetDir) {
        count++;
        System.out.println("Incrementally creating the Neo4j database . . . .");
        List<Double> insertionTimes = new ArrayList<Double>();


        try {
            BufferedReader reader = new BufferedReader(new InputStreamReader(new FileInputStream(datasetDir)));
            String line;


            int nodesCounter = 0;
            int lineCounter = 1;
            Transaction tx = null;
            long start = System.currentTimeMillis();
            long duration;


            while((line = reader.readLine()) != null) {
                if(lineCounter > 4) {
                    String[] parts = line.split("\t");


                    
                    Node srcNode = nodeIndex.get("nodeId", parts[0]).getSingle();
                    if(srcNode == null) {
                        tx = neo4jGraph.beginTx();
                        srcNode = neo4jGraph.createNode();
                        srcNode.setProperty("nodeId", parts[0]);
                        nodeIndex.add(srcNode, "nodeId", parts[0]);
                        tx.success();
                        tx.finish();
                        nodesCounter++;
                    }
                    
                    if(nodesCounter == 1000) {
                        duration = System.currentTimeMillis() - start;
                        insertionTimes.add((double) duration);
                        nodesCounter = 0;
                        start = System.currentTimeMillis();
                    }
                    
                    Node dstNode = nodeIndex.get("nodeId", parts[1]).getSingle();
                    if(dstNode == null) {
                        tx = neo4jGraph.beginTx();
                        dstNode = neo4jGraph.createNode();
                        dstNode.setProperty("nodeId", parts[1]);
                        nodeIndex.add(dstNode, "nodeId", parts[1]);
                        tx.success();
                        tx.finish();
                        nodesCounter++;
                    }
                    
                    tx = neo4jGraph.beginTx();
                    srcNode.createRelationshipTo(dstNode, RelTypes.SIMILAR);
                    tx.success();
                    tx.finish();
                    
                    if(nodesCounter == 1000) {
                        duration = System.currentTimeMillis() - start;
                        insertionTimes.add((double) duration);
                        nodesCounter = 0;
                        start = System.currentTimeMillis();
                    }
                }
                lineCounter++;
            }
            duration = System.currentTimeMillis() - start;
            insertionTimes.add((double) duration);


            reader.close();
        }
        catch (IOException e) {
            e.printStackTrace();
        }


        Utils utils = new Utils();
        utils.writeTimes(insertionTimes, Neo4jSingleInsertion.INSERTION_TIMES_OUTPUT_PATH+"."+count);
    }
    
}

Thanks,
Sotiris

Michael Hunger

unread,

Feb 3, 2014, 11:41:18 AM2/3/14

to ne...@googlegroups.com

Why do you do individual inserts when you have blocks of data?

You can often aggregate events on the application level to be inserted as a bigger batch.

see: http://maxdemarzi.com/2013/09/05/scaling-writes/

Otherwise you can also release the force-write-to-log constraint and use

Transaction tx = ((GraphDatabaseAPI)neo4jGraph).tx().unforced().begin();

Instead of neo4jGraph.beginTx();

Sotiris Beis

unread,

Feb 4, 2014, 3:16:11 AM2/4/14

to ne...@googlegroups.com

I don't have blocks of data, I measure the insertion time of 1,000 nodes and their edges (which I call a block). I am doing that because I want to simulate the creation of a graph by single element insertion.

This

Transaction tx = ((GraphDatabaseAPI)neo4jGraph).tx().unforced().begin();

does the job, but the tx() function says it's deprecated. Is this going to be a problem?

Sotiris

You received this message because you are subscribed to a topic in the Google Groups "Neo4j" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/neo4j/uZqRgCBc9lg/unsubscribe.
To unsubscribe from this group and all its topics, send an email to neo4j+un...@googlegroups.com.

Michael Hunger

unread,

Feb 4, 2014, 3:38:13 AM2/4/14

to ne...@googlegroups.com

No, it just indicates that the API might change in the future.

Sotiris Beis

unread,

Feb 4, 2014, 3:39:08 AM2/4/14

to ne...@googlegroups.com

Thank you Michael.

Debajyoti Roy

unread,

Feb 6, 2014, 2:06:09 PM2/6/14

to ne...@googlegroups.com

((GraphDatabaseAPI)neo4jGraph).tx().unforced().begin() is awesome but are there any down sides to doing this?

Michael Hunger

unread,

Feb 6, 2014, 5:09:46 PM2/6/14

to ne...@googlegroups.com

As the tx log is not forced to disk you might loose a few seconds of inserted data when it crashes

But the data on disk is still consistent

Sent from mobile device

Debajyoti Roy

unread,

Feb 6, 2014, 5:12:03 PM2/6/14

to ne...@googlegroups.com

Thanks Michael, that makes it crystal clear (i am totally going for it :) )

Sotiris Beis

unread,

Feb 24, 2014, 9:47:19 AM2/24/14

to ne...@googlegroups.com

if i choose this value keep_logical_logs to false is the same as this Transaction tx = ((GraphDatabaseAPI)neo4jGraph).tx().unforced().begin(); ?

Michael Hunger

unread,

Feb 24, 2014, 9:51:08 AM2/24/14

to ne...@googlegroups.com

Just DON'T use unforced() which is anyway not suited and thought for live usage.

Create sensible batches of operations that you execute at once, that's the best solution for now.

Michael

Sotiris Beis

unread,

Feb 24, 2014, 9:53:44 AM2/24/14

to ne...@googlegroups.com

I don't want to insert my data into batch mode, because the needs of the experiment I want to conduct, as I have explained.

You received this message because you are subscribed to a topic in the Google Groups "Neo4j" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/neo4j/uZqRgCBc9lg/unsubscribe.
To unsubscribe from this group and all its topics, send an email to neo4j+un...@googlegroups.com.

Javad Karabi

unread,

Feb 24, 2014, 5:08:26 PM2/24/14

to ne...@googlegroups.com

Sotiris Beis, check out my project:
github.com/karabijavad/cadet

an example is at:

https://github.com/karabijavad/congress-graph

so, for example, with cadet:

db = Cadet::BatchInserter::Session.open("neo4j-community-2.0.1/data/graph.db")

db.constraint :Legislator, :name

l = db.get_node(:Legislator, :thomas_id, leg["id"]["thomas"].to_i)
gender = db.get_node(:Gender, :name, leg["bio"]["gender"])
 l.outgoing(:gender) << gender

 db.close

i implement an index in ruby, so you can still find nodes based on label/key/value.

personally, i get about 2k rows a second for importing my csv data, where each row can then create up to 10 other nodes and 10 other rels.

hope this helps

Michael Hunger

unread,

Feb 24, 2014, 5:42:08 PM2/24/14

to ne...@googlegroups.com

With the batch-inserter you should get up to 1M nodes per second and depending on your memory mapping (mmio) settings for the rel-file between 100k and 500k rels/second.

Not doing index lookups that is.

Michael

Javad Karabi

unread,

Feb 24, 2014, 5:45:20 PM2/24/14

to ne...@googlegroups.com

ah wow, so looks like i probably have a lot more optimization i can still squeeze out of neo4j.

michael, i am importing on my laptop, which has a ton of resources available.

can you suggest some neo4j configuration settings which do not care about other services running on the system, but can be used to give as much resources as possible to neo4j? thanks

Michael Hunger

unread,

Feb 24, 2014, 5:52:18 PM2/24/14

to ne...@googlegroups.com

Rik put it all into a blog post: http://blog.bruggen.com/2014/02/some-neo4j-import-tweaks-what-and-where.html

For batch-insertion:

#1 it doesn't need loads of heap, just enough to pull the data through (and if you have a cache for node-lookups that has to be accommodated for too).

#2 leave enough ram for the OS and fs caches, e.g. 2-4G (depending on the total RAM)

#3 put all other memory in the mmio settings, make sure the node-file is mapped (#nodes * 14bytes) and the rel-file as much as possible (#rels * 33), for properties and strings 500MB per file are good enough

#4 You should make sure your disk is fast enough (SSD) and has the correct settings, e.g. scheduler on linux.

#5 configure cache_type=none

Javad Karabi

unread,

Feb 24, 2014, 6:20:59 PM2/24/14

to ne...@googlegroups.com

awesome! this is perfect!

one more thing, how can i query the batch inserter database to make sure the configuration was accepted and set?

Michael Hunger

unread,

Feb 24, 2014, 6:35:37 PM2/24/14

to ne...@googlegroups.com

add dump_configuration=true and it should output the config to messages.log

Michael

Javad Karabi

unread,

Feb 24, 2014, 6:37:22 PM2/24/14

to ne...@googlegroups.com

im trying to programmatically test that it accepted the configuration, though.

Javad Karabi

unread,

Feb 24, 2014, 7:09:53 PM2/24/14

to ne...@googlegroups.com

for example:

https://github.com/karabijavad/cadet/blob/master/spec/unit/batch_insert_spec.rb#L129

i just want to test that neo4j is accepting it, because when i put garbage in the hash, neo4j doesnt complain, so i dont know if its ignoring it or what

Reply all

Reply to author

Forward