import substantial-ish graph into Neo4j

57 views
Skip to first unread message

Friso van Vollenhoven

unread,
Feb 22, 2012, 4:10:33 AM2/22/12
to Neo4j
Hi all,

I am new to Neo4j, so I thought I'd go here to possibly figure out
whether I am doing the right thing...

I have a graph that I'd like to import into Neo4j and subsequently
access through the REST api. Here's some numbers and facts:
- Graph is roughly 17M nodes and 85M edges (don't know about the
degree distribution, but I could figure that out, if it makes a
difference).
- Each node has four String properties, one of ~100 chars, three of <
10 chars
- Each edge has two array properties, both an array of longs (1 <
average lentgh < 2)
- I am importing the graph through the REST batch api, doing about
1000 operations per request / transaction
- I enabled auto indexing on nodes for two properties (the 100 char
one and one of the smaller ones)
- I turned the node auto index into a fulltext one as per:
http://docs.neo4j.org/chunked/stable/rest-api-configurable-auto-indexes.html

So the question is: is this a sensible approach? Another option that
crossed my mind is to create a Java importer and then move the result
into the server for querying when it's done. Would speed things up?

Also: what kind of memory settings do I need? I can tune the -Xms and -
Xmx in the wrapper config, but does it make sense to do some more GC
tuning? Should I worry about which garbage collector the server uses?
I am not looking for low latency access / queries during import,
throughput matters most to me. I will only do queries once the whole
graph is loaded. Once loaded, the server will serve to one or two
users only and it will be read only, so not a lot of concurrency
there. I could even restart the server after the import with different
settings if that would help.

I am running version 1.6 on OS X and have 16GB of physical RAM.


Thanks for any advice! Cheers,
Friso

Michael Hunger

unread,
Feb 22, 2012, 8:14:23 AM2/22/12
to ne...@googlegroups.com
Frisco,

your approach sounds sensible. Really good choices and good understanding on the possibilities.

Of course using a java (batch) importer would be faster (probably by one to two orders of magnitude).

What you can do with the REST-API is employ multi-threading (to a sensible limit) to insert your data in parallel.

During insert you might tweak the cache_type setting in neo4j.properties to weak.

-Xmx4G should be ok, it might be interesting to try a bigger heap size but increase the YoungGeneration size to say 2-3G to not run into long gc-pauses.

After the import you should have a look at your data/graph.db directory and adjust the memory-mapping settings in neo4j.properties to match the file-sizes on disk so that neo4j is able to memory-map all the files completely.

In the end most of the time will be spent in JSON String parsing and formatting and building up the results of the rest-batch-importer in memory.

Otherwise your data seems like a really good fit.

Will the server run on the mac in the end?

What language / library are you using for accessing the REST-API ?

HTH

Michael

Friso van Vollenhoven

unread,
Feb 22, 2012, 11:48:27 AM2/22/12
to ne...@googlegroups.com
Hi Michael,

Thanks for answering!

If a Java based importer will give me a 10x bump, I'm definitely going to try that. As I understood, you just create a embedded DB from Java and later on copy that into the server installation, right? One thing. Can you create full text automatic indexes through the Java API (instead of exact ones)?

The import script is python. The graph is built in Hadoop. My hacked-together python code takes the output files from Hadoop and talks to the REST api directly (no library). I get two files out of the Hadoop job, one with a list of nodes and one with a list of edges. Nodes are identified by a domain specific ID. I create the nodes first and then keep an in memory map of domain specific ID -> node ID, such that I don't have to lookup the node IDs through an index or anything again when creating the edges. It would be involved to turn this into something multi-threaded. Rewriting in Java is a lot less work...

For reading / querying, there is a simple HTML / JavaScript based UI on top of everything that talks directly to the REST API. I can enter a Cypher query and see the results, do some highlighting of paths and look into node properties. It's very basic. I am working on ad hoc / prototype stuff, so this will not ever become a production setup (famous last words), which is why I keep it on my mac. As long as everything fits in RAM, it should be fine (IO is terrible on the Mac, especially with full disk encryption, which I have).


Thanks,
Friso

Michael Hunger

unread,
Feb 22, 2012, 4:17:35 PM2/22/12
to ne...@googlegroups.com
Am 22.02.2012 um 17:48 schrieb Friso van Vollenhoven:

Hi Michael,

Thanks for answering!

If a Java based importer will give me a 10x bump, I'm definitely going to try that. As I understood, you just create a embedded DB from Java and later on copy that into the server installation, right? One thing. Can you create full text automatic indexes through the Java API (instead of exact ones)?
Yes on both accounts.


The import script is python. The graph is built in Hadoop. My hacked-together python code takes the output files from Hadoop and talks to the REST api directly (no library). I get two files out of the Hadoop job, one with a list of nodes and one with a list of edges. Nodes are identified by a domain specific ID. I create the nodes first and then keep an in memory map of domain specific ID -> node ID, such that I don't have to lookup the node IDs through an index or anything again when creating the edges. It would be involved to turn this into something multi-threaded. Rewriting in Java is a lot less work...
Yep that should be superfast with the batch-inserter. See an example here: https://github.com/jexp/batch-import


For reading / querying, there is a simple HTML / JavaScript based UI on top of everything that talks directly to the REST API. I can enter a Cypher query and see the results, do some highlighting of paths and look into node properties. It's very basic. I am working on ad hoc / prototype stuff, so this will not ever become a production setup (famous last words), which is why I keep it on my mac. As long as everything fits in RAM, it should be fine (IO is terrible on the Mac, especially with full disk encryption, which I have).

Ouch to the disk encryption :) Do you have a SSD, that should help? Otherwise would be interesting (as it is read-only) to set up the graph on a RAMdisk (if you ever encounter issues). Pre-reading the disk files once on startup should speed it up a lot as well.

for i in *; do dd if=$i of=/dev/null bs=1000000; end

Mattias Persson

unread,
Feb 23, 2012, 11:57:10 AM2/23/12
to ne...@googlegroups.com


2012/2/22 Michael Hunger <michael...@neotechnology.com>


Am 22.02.2012 um 17:48 schrieb Friso van Vollenhoven:

Hi Michael,

Thanks for answering!

If a Java based importer will give me a 10x bump, I'm definitely going to try that. As I understood, you just create a embedded DB from Java and later on copy that into the server installation, right? One thing. Can you create full text automatic indexes through the Java API (instead of exact ones)?
Yes on both accounts.



--
Mattias Persson, [mat...@neotechnology.com]
Hacker, Neo Technology
www.neotechnology.com

Friso van Vollenhoven

unread,
Feb 23, 2012, 12:39:31 PM2/23/12
to ne...@googlegroups.com
Hi All,
Thanks for all the help. I've rewritten the insertion in Java and it's indeed *a lot* faster.

Thanks,
Friso
Reply all
Reply to author
Forward
0 new messages