update properties on 10 million nodes with core api

91 views
Skip to first unread message

Klemens Engelbrechtsmüller

unread,
Oct 28, 2015, 6:55:30 PM10/28/15
to Neo4j

Hello! I use the neo4j java core api and want to update 10 million nodes. I thought it will be better to do it with multithreading but the performance is not that good (35 minutes for setting properties).

To explain: Each node "Person" has at least one relation "POINTSREL" to a "Point" node, which has the property "Points". I want to sum up the points from the "Point" node and set it as property to the "Person" node.


Here is my code:


Transaction transaction = service.beginTx();
ResourceIterator<Node> iterator = service.findNodes(Labels.person);
transaction.success();
transaction.close();

ExecutorService executor = Executors.newFixedThreadPool(5);

while(iterator.hasNext()){
    executor.execute(new MyJob(iterator.next()));
}

//wait until all threads are done
executor.shutdown();

try {
    executor.awaitTermination(Long.MAX_VALUE, TimeUnit.NANOSECONDS);
} catch (InterruptedException e) {
    e.printStackTrace();
}


And here the runnable class


private class MyJob implements Runnable {

    private Node node;

    /* collect useful parameters in the constructor */
    public MyJob(Node node) {
        this.node = node;
    }

    public void run() {
        Transaction transaction = service.beginTx();
        Iterable<org.neo4j.graphdb.Relationship> rel = this.node.getRelationships(RelationType.POINTSREL, Direction.OUTGOING);

        double sum = 0;
        for(org.neo4j.graphdb.Relationship entry : rel){
            try{
                sum += (Double)entry.getEndNode().getProperty("Points");
            } catch(Exception e){
                e.printStackTrace();
            }
        }
        this.node. double sum = 0; for(org.neo4j.graphdb.Relationship entry : rel){ try{ sum += (Double)entry.getEndNode().getProperty("Points"); } catch(Exception e){ e.printStackTrace(); } } this.node.setProperty("Sum", sum);

        transaction.success();
        transaction.close();
    }
}


Is there a better (faster) way to do that?


About my setting: AWS Instance with 8 CPUs and 32GB ram


neo4j-wrapper.conf


# Java Heap Size: by default the Java heap size is dynamically
# calculated based on available system resources.
# Uncomment these lines to set specific initial and maximum
# heap size in MB.
wrapper.java.initmemory=16000
wrapper.java.maxmemory=16000


neo4j.properties


# The type of cache to use for nodes and relationships.
cache_type=soft
cache.memory_ratio=30.0
neostore.nodestore.db.mapped_memory=2G
neostore.relationshipstore.db.mapped_memory=7G
neostore.propertystore.db.mapped_memory=2G
neostore.propertystore.db.strings.mapped_memory=2G
neostore.propertystore.db.arrays.mapped_memory=512M


Do you have any ideas? Thank you

Michael Hunger

unread,
Oct 28, 2015, 9:05:17 PM10/28/15
to ne...@googlegroups.com
HI Klemens, sorry I missed your email.

Which version are you running?

You should include up to 1000 - 10k nodes in your tx and not do one by one.

If you are on 2.2+ you have to set dbms.pagecache.memory=12G instead of the mmio settings.

Why do you compute the sum twice?

You might even be faster separating the compute and update.

I.e. putting the sums into a double array keyed by the node-id and then just having the threads in your pool filling that array and other threads pulling out the values

But if its only 10M you might also just use cypher (provided you keep the 16G heap)
And batch it into 500k node-updates

MATCH (p:Person)
WHERE not exists(p.sum) 
WITH p LIMIT 500000
MATCH (p)-[POINTSREL]->(point)
WITH p, sum(point.points) as sum
SET p.sum = points;

you can also just use skip + limit to page through the set of people and execute the cypher concurrently.

Michael



--
You received this message because you are subscribed to the Google Groups "Neo4j" group.
To unsubscribe from this group and stop receiving emails from it, send an email to neo4j+un...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Sun Yuhan

unread,
Nov 5, 2015, 8:27:18 PM11/5/15
to Neo4j, kl.engelbre...@gmail.com
Hope java api of neo4j will help you. There is a interface called batchinserter can load large amount of data quickly. But be careful because it is not transactional. You cannot recover from errors.

Michael Hunger

unread,
Nov 5, 2015, 9:34:37 PM11/5/15
to ne...@googlegroups.com
Also Klemens,

it is a bit confusing that you share Java code but refer to server config?

Which Neo4j version do you use? The config seem to be for 2.1.x
For 2.2.x there is dbms.pagecache.memory=8G
I'd use cache_type=none

Where and how do you run this code?

Usually neo4j can update stuff in a multi-threaded-way really quickly.

I can update 10M nodes in 20 seconds with 24 threads on a 6 core machine.

You probably also want to batch it up a bit, i.e. doing 1000 to 10k nodes per transaction and job.

If you have 8 cores you should use a threadpool of size 16 or 32

Michael


Clark Richey

unread,
Nov 5, 2015, 10:18:25 PM11/5/15
to ne...@googlegroups.com
It also seems like you are updating a single node per transaction. That's not efficient. You can do thousands of updates in a single transaction. 

Sent from my iPhone
Reply all
Reply to author
Forward
0 new messages