Neo4j Insert Performance

Timothy Braun

unread,

Jan 19, 2012, 10:55:55 AM1/19/12

to ne...@googlegroups.com

Hey Everyone,

We are experimenting with using Neo4j for managing the social graph side of our site. In this scenario, we use MongoDB as our primary datastore and synchronize lightweight nodes/relationships to Neo4j to allow for quick graph based queries.

Our initial tests were great but now performance has become a major hinderance. Simple inserts cause the cpu utilization of the EC2 instance to spike to ~100% and inserting/updating 2,000 records takes ~300 seconds. It is running on a small EC2 instance with 1.7GB of ram and the instance is dedicated to Neo4j. It currently has 810k nodes and 1.4 million relationships for a total physical db size of 267MB. The java heap size is currently set at 512MB. It should also be noted that read performance seems to be acceptable as long as no writes are currently in the works.

We have looked around on the forums and we can't seem to find any similar situations. Any help would be greatly appreciated. As a side note, we use SOLR for indexing and run a similar synchronization process there. Performance for the SOLR indexes is remarkable so we believe this issue falls somewhere in the realm of Neo4j. Also, if there is any additional information I can provide, please ask and I will do my best to get it for you.

Thanks in advance,

Timothy Braun

Team ClubCreate

Timothy Braun

unread,

Jan 19, 2012, 10:57:23 AM1/19/12

to ne...@googlegroups.com

I should also mention, it is running Neo4j 1.5 (latest stable release) and we are using the REST API to access the graph.

Thanks,

Timothy Braun

Team ClubCreate

Peter Neubauer

unread,

Jan 19, 2012, 11:04:06 AM1/19/12

to ne...@googlegroups.com

Timothy,

what do your inserts look like? How big batches are you doing?

Cheers,

/peter neubauer

Google: neubauer.peter
Skype: peter.neubauer
Phone: +46 704 106975
LinkedIn: http://www.linkedin.com/in/neubauer
Twitter: @peterneubauer
Tungle: tungle.me/peterneubauer

brew install neo4j && neo4j start
heroku addons:add neo4j

Timothy Braun

unread,

Jan 19, 2012, 11:21:38 AM1/19/12

to ne...@googlegroups.com

Inserts consist of simple nodes of two types at the moment. A user node which has 4 properties, 3 of which are added to a 'users' index, and a song node which has 5 properties, 2 of which are added to a 'songs' index.

We've experimented with both batched and non-batched inserts and we are getting better performance with non-batched inserts (in neo4j terms that is). The process will currently max out at syncing 2,000 nodes in a single go.

Here is a sample of the sync of user nodes, nothing terribly complicated:

git://gist.github.com/1640949.git

Timothy Braun

unread,

Jan 19, 2012, 11:22:43 AM1/19/12

to ne...@googlegroups.com

Here's the http link rather: https://gist.github.com/1640949

Peter Neubauer

unread,

Jan 19, 2012, 11:24:55 AM1/19/12

to ne...@googlegroups.com

Timothy,
I am suspecting JSON parsing overhead being the cause of this. Is
there any possibility you could write the insert in Java and put it as
a plugin into the server and only push over the data you need as a
file, stream or compact data? Otherwise, contact me off-list and we
can try to work out a good insertion scheme.

Cheers,

/peter neubauer

Google: neubauer.peter
Skype: peter.neubauer
Phone: +46 704 106975
LinkedIn: http://www.linkedin.com/in/neubauer
Twitter: @peterneubauer
Tungle: tungle.me/peterneubauer

brew install neo4j && neo4j start
heroku addons:add neo4j

Timothy Braun

unread,

Jan 19, 2012, 11:34:37 AM1/19/12

to ne...@googlegroups.com

Peter,
Thanks again for the quick response, but I don't think it's the JSON parsing causing the issue. The requests will have finished and the cpu utilization will still be spiked for quite some time (30-60 seconds). And if JSON parsing was causing the issue, wouldn't we see similar performance issues on our SOLR instances (albeit they use XML for data transfers but the documents contain drastically more content per node than we are syncing to Neo4j).

Thanks,
Tim

Peter Neubauer

unread,

Jan 19, 2012, 11:47:24 AM1/19/12

to ne...@googlegroups.com

Well,
It also could be GC pauses due to tweaking of the JVM needing to be done. The JSON parsing produces quite some objects, so that might be a cause. However, I think one needs to look at bit closer ...

--

Timothy Braun

unread,

Jan 19, 2012, 11:51:55 AM1/19/12

to ne...@googlegroups.com

Peter,

Again, thanks for the quick replies, it is most appreciated.

Any suggestions on where to start? Logs of some nature? Neo4j is new to us so we aren't yet familiar with it's inner workings.

Thanks,

Tim

Michael Hunger

unread,

Jan 19, 2012, 12:15:09 PM1/19/12

to ne...@googlegroups.com

Timothy could you by chance share your project + data-inserter with us to perform some profiling? That would be awesome.

Did you write the abstraction layer to the neo4j REST-API yourself or do you use some binding?

Single node + single relationship insert is one of the least performant ways to insert data? The batch-API should really speed this up quite a lot.

Otherwise looking into something like the geoff plugin for importing data in a consistent import format should also be a good way to go.

http://py2neo.org/geoff/

Peter can point you to a working version of this plugin to install in your server.

Is this just an initial import or an ongoing import ?

EC2 disk performance is always an issue, do you use EBS volumes or local_storage for the graph?

Could you enable gc-logs in your neo4j-wrapper.conf ?

Timothy Braun

unread,

Jan 19, 2012, 12:28:24 PM1/19/12

to ne...@googlegroups.com

Hey Michael,

I am more than happy to share anything that would help with the situation. It would have to be privately though.

The abstraction layer is neo4jphp which you can find here: https://github.com/jadell/neo4jphp. We are php based so we try to stay in the realm of php when possible, but more than capable to expand into other fronts.

I will take a look at geoff and see how it can assist us. For batches, we have tried both single inserts and batch inserts and single inserts are currently achieving the best performance oddly enough. For 2,000 records which are a mix of inserts and updates takes ~196 seconds individually as compared to ~300 seconds as a batch.

This will be an ongoing process. Like our SOLR sync, it will occur every 60 seconds or so syncing the latest versions from our mongo datastore.

All our EC2 instances are EBS based.

Absolutely. I will turn on gc-logs and perform a couple of syncs later this afternoon.

Thanks,

Tim

Timothy Braun

unread,

Jan 19, 2012, 5:07:38 PM1/19/12

to ne...@googlegroups.com

Here is the gc log.

From 146.736 - 357.085 (211 seconds) was a batch update of 2000 records.

From 648.239 - 1006.387 (358 seconds) was during individual updates containing the same 2000 records.

This was the first time we noticed the batch process taking a shorter amount of time than the individual inserts/updates but it makes sense. Still, performance to update 2000 (~105 ms per record) records isn't great.

Please let me know if there is any additional information I can provide.

Thanks again,

Tim

neo4j-gc.log

Timothy Braun

unread,

Jan 19, 2012, 5:12:22 PM1/19/12

to ne...@googlegroups.com

Oops, duped some of the content in that log, here's a correct log

neo4j-gc.log

Michael Hunger

unread,

Jan 19, 2012, 5:23:36 PM1/19/12

to ne...@googlegroups.com

The gc log looks ok.

So it must be something else causing the cpu spikes, would love to profile that.

Is it possible to run your php scripts from a php-shell?

Michael

> <neo4j-gc.log>

Timothy Braun

unread,

Jan 19, 2012, 5:26:01 PM1/19/12

to ne...@googlegroups.com

These scripts are triggered via the php cli so yes, I would imagine so. What are you hoping for?

Michael Hunger

unread,

Jan 19, 2012, 5:28:27 PM1/19/12

to ne...@googlegroups.com

To see in a profiler where the actual time is spent in the neo4j java process.

I don't know of your expertise there but probably we can set up an aws instance running neo4j that runs with profiling enabled and you point your scripts to it?

What aws region are you in and did you use a public AMI for setting up the server?

Michael

Timothy Braun

unread,

Jan 19, 2012, 5:38:13 PM1/19/12

to ne...@googlegroups.com

We are running this instance in us-east (virginia).

The ami is a private ami we use, but it's based off of ubuntu server 9.04 I believe.

If you setup an instance, I would be more than happy to point my scripts at it as a test. I can even throw the db up on S3 somewhere and give you access to it for the test.

Thanks,
Tim

Reply all

Reply to author

Forward