Poor insert performance

1,138 views
Skip to first unread message

Jason W

unread,
Dec 28, 2013, 6:30:35 PM12/28/13
to ne...@googlegroups.com
Hi Everyone,
I'm relatively new to neo4j and i'm running into slowness when trying to insert a batch of data. My strategy has been to write the batch of Cypher queries to a text file, and then pipe that into the neo4j-shell. Here is a description of my data.

Start with a single "user" node.
Create (if not exists) many "attribute" nodes.
Create relationships between "user" node and "attribute" nodes.

In my benchmarking, I'm creating 10,000 attribute nodes and relationships from the user to the attributes. The caveat is that the attribute nodes may already exist, and if it does I want to use the existing one instead of creating a new one. My current approach uses the MERGE command to create the attribute nodes (or return the node if it doesn't exist). My Cypher queries look something like this:

MERGE (a:Attribute {coordinate: '#{key}'}) WITH a 
MATCH (u:User {name: '#{user}'}) CREATE UNIQUE (u)-[r:HAS_ATTR]->(a)

Running 10,000 sequential queries like this to insert my data is quite slow. I'm getting somewhere around 20 inserts per second. Here are some things I've tried to optimize:

-Batch these into a large transaction in a text file, and pipe it into the neo4j-shell
-Batch these into a large single command in a text file, and pipe into neo4j-shell
-Break into parallel jobs and insert multi threaded. Each query must be a single transaction otherwise it locks.
-Separate the MERGE commands into a batch, and the CREATE relationship commands into a separate batch

I've done the tooling benchmark to test file system performance (http://docs.neo4j.org/chunked/milestone/linux-performance-guide.html) and my results are great. I should be able to get upwards of 70k records/sec based on the benchmark.

Can anyone advise what is the best strategy to import this type of data quickly?




This message contains confidential information and is intended only for the individual named. If you are not the named addressee you should not disseminate, distribute or copy this e-mail. Please notify the sender immediately by e-mail if you have received this e-mail by mistake and delete this e-mail from your system. If you are not the intended recipient you are notified that disclosing, copying, distributing or taking any action in reliance on the contents of this information is strictly prohibited.

Michael Hunger

unread,
Dec 28, 2013, 7:51:37 PM12/28/13
to ne...@googlegroups.com
Jason,

usually you would use parameters to speed it up. The shell also supports parameters, you can use "export param=value"

e.g.
export key="#{key}"
export user="#{user}"
MERGE (a:Attribute {coordinate: {key}}) WITH a 
MATCH (u:User {name: {user}}) CREATE UNIQUE (u)-[r:HAS_ATTR]->(a);

did you create a unique index for your merge command ? (or at least a normal index on :Attribute(coordinate)

What are the attributes for?

Also combining around  20-50k elements in a single tx would speed it up.

begin

export key="#{key}"
export user="#{user}"
MERGE (a:Attribute {coordinate: {key}}) WITH a 
MATCH (u:User {name: {user}}) CREATE UNIQUE (u)-[r:HAS_ATTR]->(a);
...
...
...
commit

did you try to run bin/neo4j-shell -file file

are your running against a running server? or the shell with -path ?
You probably want to do the former, so that it can use the memory config of the running server.
Otherwise it might make sense to configure the neo4j-shell script (if you edit it there is a line like this, add some sensible memory config to it):
EXTRA_JVM_ARGUMENTS="-Xmx8G -Xms8G -Xmn1G"


For fast imports of csv files with a single cypher statement like yours perhaps my neo4j-shell import tools would be helpful :)


HTH

Michael

--
You received this message because you are subscribed to the Google Groups "Neo4j" group.
To unsubscribe from this group and stop receiving emails from it, send an email to neo4j+un...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Jason W

unread,
Dec 29, 2013, 12:29:29 AM12/29/13
to ne...@googlegroups.com
Michael,
Thanks for the reply. Your tool looks pretty interesting! Looks like it allows me to user parameters by providing a CSV file of values. I'll give it a try.

To answer your questions..
I have created a unique index on :Attribute(coordinate). The attributes are simply nodes that need to be connected to the user. Different users will share some of these attributes, and I need to be able to query which ones are shared (or not shared) between various users. I was running by piping the cyper queries to just "neo4j-shell" with a running server. Should I be using the "-file" option?

Jason W

unread,
Dec 29, 2013, 1:58:16 AM12/29/13
to ne...@googlegroups.com
Michael,
I tried out your tool and I love the ease at which I was able to get going. Unfortunately, it hasn't really helped my performance issue.

Here's my command:
import-cypher -i input.csv -i output.csv MERGE (a:Attribute {coordinate: {coordinate}}) WITH a match (u:User {name: 'jason'}) CREATE UNIQUE (u)-[r:HAS_ATTRIBUTE]->(a)

input.csv looks like this:
coordinate
1:1
1:2
1:3
etc..

Running a test with 1000 attributes in input.csv took 230 seconds, which is a measley 4.3 inserts per second.

Jason W

unread,
Dec 29, 2013, 4:55:18 AM12/29/13
to ne...@googlegroups.com
Just realized I didn't have the index set on the right property. Doh!

After adding the index, I was able to insert a batch of 1,000 in 3.2 seconds which feels much better. When trying a larger batch though, the performance does not scale linearly - 25,000 batch took almost 15 minutes. I can clearly see disk writes and garbage collection playing a role now, so i'm playing around with batch sizes now. I'm on a linux server with 64GB available memory, 64 cores, and software RAID 10 over 4 x 7,200 RPM disks. I'm using default settings on neo4j.

Any tuning advice would be greatly appreciated!

Axel Morgner

unread,
Dec 29, 2013, 4:59:30 AM12/29/13
to ne...@googlegroups.com
Jason, you might also try some of the tuning options described here http://structr.org/blog/neo4j-performance-on-ext4.

Best
Axel

This message contains confidential information and is intended only for the individual named. If you are not the named addressee you should not disseminate, distribute or copy this e-mail. Please notify the sender immediately by e-mail if you have received this e-mail by mistake and delete this e-mail from your system. If you are not the intended recipient you are notified that disclosing, copying, distributing or taking any action in reliance on the contents of this information is strictly prohibited.
--
You received this message because you are subscribed to the Google Groups "Neo4j" group.
To unsubscribe from this group and stop receiving emails from it, send an email to neo4j+un...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


--

Axel Morgner
CEO Structr (c/o Morgner UG) · Hanauer Landstr. 291a · 60314 Frankfurt · Germany
Twitter: @amorgner
Phone: +49 151 40522060
Skype: axel.morgner

Structr - Award-Winning Open Source CMS and Web Framework based on Neo4j
Structr Mailing List and Forum
Graph Database Usergroup "graphdb-frankfurt"

Michael Hunger

unread,
Dec 29, 2013, 5:15:37 AM12/29/13
to ne...@googlegroups.com
Do you run it against server or with -path?

If the latter please remember to set the memory options in the shell script

I'll try your example later today

Sent from mobile device

Jason Wang

unread,
Dec 29, 2013, 11:19:13 AM12/29/13
to ne...@googlegroups.com
I'm running against a running server, with mostly default settings.
You received this message because you are subscribed to a topic in the Google Groups "Neo4j" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/neo4j/YpeewfD8-Is/unsubscribe.
To unsubscribe from this group and all its topics, send an email to neo4j+un...@googlegroups.com.

For more options, visit https://groups.google.com/groups/opt_out.

Jason W

unread,
Dec 29, 2013, 4:26:53 PM12/29/13
to ne...@googlegroups.com
Michael,
It looks like the create UNIQUE is the main bottleneck. If I use a normal CREATE instead of CREATE UNIQUE, the import speeds up by several orders of magnitude. 

The only reason I need UNIQUE here is to make sure I'm not creating the same relationship more than once between 2 nodes. Is there some type of index I need to setup on the relationship to improve the performance? Is it possible to enforce a unique constraint on the relationship? 


On Sunday, December 29, 2013 10:19:13 AM UTC-6, Jason W wrote:
I'm running against a running server, with mostly default settings.
To unsubscribe from this group and stop receiving emails from it, send an email to neo4j+unsubscribe@googlegroups.com.

For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to a topic in the Google Groups "Neo4j" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/neo4j/YpeewfD8-Is/unsubscribe.
To unsubscribe from this group and all its topics, send an email to neo4j+unsubscribe@googlegroups.com.

For more options, visit https://groups.google.com/groups/opt_out.



This message contains confidential information and is intended only for the individual named. If you are not the named addressee you should not disseminate, distribute or copy this e-mail. Please notify the sender immediately by e-mail if you have received this e-mail by mistake and delete this e-mail from your system. If you are not the intended recipient you are notified that disclosing, copying, distributing or taking any action in reliance on the contents of this information is strictly prohibited.

Michael Hunger

unread,
Dec 29, 2013, 6:05:36 PM12/29/13
to ne...@googlegroups.com
Jason,

I did some more testing.

It seems to be something with reading transactional state for node-label+property, still investigating.

To test I ran with 1M and 100k large CSV files both with your statement and also with one that uses merge twice
and for comparison I also did a plain merge, a plain create, a plain match and a "hand coded merge"

create and match are both superfast, 12 & 8 seconds for 100k nodes as expected.

All the others take much longer than they should.

Usually you can create 30k nodes per second even with the transactional API.

If it works for you now, can I recommend that you satisfy uniqueness externally and just use a plain create for your import?

I continue to investigate.

Cheers

Michael


Here are some of my code snipptets:

create data for coordinates and users

echo coordinate,user > input.csv; for a in `seq 1 100`; do for i in `seq 1 1000`; do echo "$a:$i,user$a" >> input.csv ; done; done

your code, but you don't create users, so the more coordinates / attributes you put into it the more connections you'll create to a single user.

time bin/neo4j-shell -c "import-cypher -i input.csv MERGE (a:Attribute {coordinate: {coordinate}}) WITH a match (u:User {name: {user}}) CREATE UNIQUE (u)-[r:HAS_ATTRIBUTE]->(a)"

cleanup and initial index creation

rm -rf import100.db
bin/neo4j-shell -path import100.db -c "create index on :User(name); && create index on :Attribute(coordinate); && schema"

Query: MERGE (a:Attribute {coordinate: {coordinate}}) MERGE (u:User {name: {user}}) CREATE UNIQUE (u)-[r:HAS_ATTRIBUTE]->(a) infile input100.csv delim ',' quoted false outfile (none) batch-size 20000
Import statement execution created 0 rows of output.

for 1000 elemnts
real 0m39.419s
user 0m46.830s
sys 0m0.976s

CREATE CONSTRAINT ON (u:User) ASSERT u.name IS UNIQUE;
CREATE CONSTRAINT ON (a:Attribute) ASSERT a.coordinate IS UNIQUE;
schema

time bin/neo4j-shell -path import100.db -c "import-cypher -i input.csv MERGE (a:Attribute {coordinate: {coordinate}}) MERGE (u:User {name: {user}}) CREATE UNIQUE (u)-[r:HAS_ATTRIBUTE]->(a)"

Query: MERGE (a:Attribute {coordinate: {coordinate}}) MERGE (u:User {name: {user}}) CREATE UNIQUE (u)-[r:HAS_ATTRIBUTE]->(a) infile input100.csv delim ',' quoted false outfile (none) batch-size 20000
Import statement execution created 0 rows of output.

for 1000 elements

real 2m11.231s
user 2m9.106s
sys 0m22.003s

for 100k lines

"hand written" MERGE -> 7:30 min

time bin/neo4j-shell -path import100.db -c "import-cypher -i input.csv MATCH (a:Attribute {coordinate: {coordinate}}) WITH count(*) as c where c = 0 CREATE (a:Attribute {coordinate: {coordinate}}) return count(*)"

MERGE only -> 9:30 min

time bin/neo4j-shell -path import100.db -c "import-cypher -i input.csv MERGE (a:Attribute {coordinate: {coordinate}})"

CREATE only -> 12s

time bin/neo4j-shell -path import100.db -c "import-cypher -i input.csv CREATE (a:Attribute {coordinate: {coordinate}})"

MATCH only -> 8s

time bin/neo4j-shell -path import100.db -c "import-cypher -i input.csv MATCH (a:Attribute {coordinate: {coordinate}}) return count(*)"


Am 29.12.2013 um 22:26 schrieb Jason W <ja...@genebygene.com>:

> Michael,
> It looks like the create UNIQUE is the main bottleneck. If I use a normal CREATE instead of CREATE UNIQUE, the import speeds up by several orders of magnitude.
>
> The only reason I need UNIQUE here is to make sure I'm not creating the same relationship more than once between 2 nodes. Is there some type of index I need to setup on the relationship to improve the performance? Is it possible to enforce a unique constraint on the relationship?
>
>
> On Sunday, December 29, 2013 10:19:13 AM UTC-6, Jason W wrote:
> I'm running against a running server, with mostly default settings.
>
>> --
>> You received this message because you are subscribed to a topic in the Google Groups "Neo4j" group.
>> To unsubscribe from this topic, visit https://groups.google.com/d/topic/neo4j/YpeewfD8-Is/unsubscribe.
>> To unsubscribe from this group and all its topics, send an email to neo4j+un...@googlegroups.com.

Michael Hunger

unread,
Dec 30, 2013, 5:57:17 AM12/30/13
to ne...@googlegroups.com
Weird,

I re-did the tests, now with 1M lines, i.e. 1000 users and 1000 attributes in total (i.e. max 1k connections per user) and got: 20:40
with MERGE (a:Attribute {coordinate: {coordinate}}) MERGE (u:User {name: {user}}) CREATE UNIQUE (u)-[r:HAS_ATTRIBUTE]->(a)

Something else you can do is to change your approach to create users and attributes first, just using create statements (which also make sure they are added to the index) (both of which take about 2s)
and then creating the relationships using match + create (unique), i.e.

With that I could confirm your findings regarding create unique (and merge ftm), some of which is expected but probably not that much of an impact.

> MATCH (a:Attribute {coordinate: {coordinate}}), (u:User {name: {user}}) CREATE UNIQUE (u)-[r:HAS_ATTRIBUTE]->(a)

this takes 7:10
for 1k users 1k attributes and 1M rels

> MATCH (a:Attribute {coordinate: {coordinate}}), (u:User {name: {user}}) CREATE (u)-[r:HAS_ATTRIBUTE]->(a)

this takes 1:10
for 1k users 1k attributes and 1M rels

you should also be able to use MERGE for the relationship

> MATCH (a:Attribute {coordinate: {coordinate}}), (u:User {name: {user}}) MERGE (u)-[r:HAS_ATTRIBUTE]->(a)

this takes 8:30
for 1k users 1k attributes and 1M rels

Jason W

unread,
Dec 30, 2013, 10:35:43 PM12/30/13
to ne...@googlegroups.com
Michael,
I did some more testing as well and found two very interesting things:
1. Creating the relationship using MERGE is nearly as fast as CREATE, but will not create the same relationship twice which is basically the same as CREATE UNIQUE, so this is great.
1. When using create index instead of unique constraint on the coordinate property, the performance is 4x better. I really would like to enforce uniqueness at the DB level, so this is not good.

Thoughts?
--
Reply all
Reply to author
Forward
0 new messages