Account Options

  1. Sign in
The old Google Groups will be going away soon, but your browser is incompatible with the new version.
Google Groups Home
« Groups Home
batch import performance breakdown after 30M edges
There are currently too many topics in this group that display first. To make this topic appear first, remove this option from another topic.
There was an error processing your request. Please try again.
flag
  15 messages - Collapse all  -  Translate all to Translated (View all originals)
The group you are posting to is a Usenet group. Messages posted to this group will make your email address visible to anyone on the Internet.
Your reply message has not been sent.
Your post was successful
 
From:
To:
Cc:
Followup To:
Add Cc | Add Followup-to | Edit Subject
Subject:
Validation:
For verification purposes please type the characters you see in the picture below or the numbers you hear by clicking the accessibility icon. Listen and type the numbers you hear
 
Gergely Svigruha  
View profile  
 More options Oct 3 2012, 8:23 am
From: Gergely Svigruha <sgerg...@gmail.com>
Date: Wed, 3 Oct 2012 05:23:42 -0700 (PDT)
Local: Wed, Oct 3 2012 8:23 am
Subject: batch import performance breakdown after 30M edges

Hi,

I use neo4j community 1.8 on a Linux 3.4.6-2.10-desktop machine (4core 32G
RAM), my JDK version is 1.7.0_04. I start the JVM with max 4G RAM
(-Xmx4096m). I inserted a graph with 1M vertices and 70M edges, stored in
two CSV files. I've observed that after inserting 30M edges
the performance breaks down. For the first 30M edges it takes avg 2 seconds
to insert 1M edge, after that it takes 40-60 second per 1M edges. What can
be the cause of it and how can I improve this performance?
Thanks!

Greg

*My code*

*import* java.io.BufferedReader;

*import* java.io.File;

*import* java.io.FileNotFoundException;

*import* java.io.FileReader;

*import* java.io.IOException;

*import* java.util.HashMap;

*import* java.util.Map;

*import* org.neo4j.graphdb.RelationshipType;

*import* org.neo4j.helpers.collection.MapUtil;

*import* org.neo4j.kernel.impl.util.FileUtils;

*import* org.neo4j.unsafe.batchinsert.BatchInserter;

*import* org.neo4j.unsafe.batchinsert.BatchInserters;

*public* *class* GraphImporter {

       *private* *long* nodeIdx=0;

       *private* Map<Long,Long> nodeMap = *new* HashMap<Long, Long>();

       *enum* RelType *implements* RelationshipType {

              *KNOWS*

       }

       *private* *void* createNode(*long* pnum, BatchInserter db,
Map<String, Object> prop) {

              *if*(!nodeMap.containsKey(pnum)) {

                     nodeIdx++;

                     nodeMap.put(pnum,  nodeIdx);

                     prop.put("Id", pnum);

                  db.createNode(nodeIdx, prop);

              }

       }

       *private* *long* getNodeNum(*long* pnum) *throws* Exception {

              *if*(nodeMap.containsKey(pnum)) {

                     *return* nodeMap.get(pnum);

              } *else* {

                     *throw* *new* Exception("Missing person: "+pnum);

              }

       }

       *public* *static* *void* main(String[] args) {

              GraphImporter importer = *new* GraphImporter();

              importer.load(args[0], args[1], args[2]);

       }

       *private* *void* load(String vertexFile, String edgeFile, String
dbpath) {

              BatchInserter db =  *null*;

              BufferedReader reader = *null*;

              *long* timestmp=0;

              *try* {

                     File graphDb = *new* File(dbpath);

                     *if* (graphDb.exists()) {

                   FileUtils.*deleteRecursively*(graphDb);

               }

                     *long* nodes = 0;

                     *long* errorRows = 0;

                     Map<String, String> config = *new* HashMap<String,
String>();

              config = MapUtil.*load*( *new* File( "batch.properties" ) );

              db = BatchInserters.*inserter*(dbpath, config);

                     reader = *new* BufferedReader(*new* FileReader(*new*File(vertexFile)));

                     System.*out*.println("Loading nodes..");

                     reader.readLine();

            String line = *null*;

                     *while* ((line = reader.readLine()) != *null*) {

                           String[] lineData = line.split(",");

                           *try* {

                                  Map<String, Object> prop = *new*HashMap<String, Object>(10);

                                  prop.put("City", lineData[3].replace("\"",
""));

                                  prop.put("Country", lineData[4].replace(
"\"", ""));

                                  prop.put("Gender", lineData[5].replace(
"\"", ""));

                                  createNode(Long.*valueOf*
(lineData[0].replace("\"", "")), db, prop);

                           } *catch* (NumberFormatException e) {

                                  errorRows++;

                           }

                           nodes++;

                   *if*(nodes%1000000==0) {

                     System.*out*.println("Nodes: "+nodes+"("+errorRows+");
"+nodeIdx);

                   }

               }

                     System.*out*.println("Total nodes: "+nodes);

                     reader.close();

                     reader = *new* BufferedReader(*new* FileReader(*new*File(edgeFile)));

                     System.*out*.println("Loading edges..");

                     *long* node1 = 0;

                     *long* node2 = 0;

                     reader.readLine();

                     *long* edges = 0;

                     errorRows=0;

            line = *null*;

            timestmp = System.*currentTimeMillis*();

                     *while* ((line = reader.readLine()) != *null*) {

                           String[] lineData = line.split(",");

                           *try* {

                                  node1 = getNodeNum(Long.*valueOf*
(lineData[0].replace("\"", "")));

                                  node2 = getNodeNum(Long.*valueOf*
(lineData[1].replace("\"", "")));

                          db.createRelationship(node1, node2, RelType.*KNOWS
*, *null*);

                           } *catch* (NumberFormatException e) {

                                  errorRows++;

                           }  *catch*(Exception e) {

                                  e.printStackTrace();

                           }

                           edges++;

                   *if*(edges%1000000==0) {

                     *long* currTimestmp =  System.*currentTimeMillis*();

                     System.*out*.println("Edges: "+edges+" ("+errorRows+")"
+" time: "+

                                   (currTimestmp - timestmp)/1000);

                          timestmp = currTimestmp;

                   }

               }

                     System.*out*.println("Data successfully imported!");

              } *catch* (FileNotFoundException e) {

                     e.printStackTrace();

              } *catch* (IOException e) {

                     e.printStackTrace();

              } *catch* (Throwable e) {

                     e.printStackTrace();

              } *finally* {

                     *try* {

                           *if*(db != *null*) {

                                  db.shutdown();

                           }

                           *if*(reader != *null*) {

                                  reader.close();

                           }

                     } *catch* (Throwable e) {

                           e.printStackTrace();

                     }

              }

       }

}

*Batch properties*

remote_logging_host=127.0.0.1

forced_kernel_id=

read_only=false

neo4j.ext.udc.host=udc.neo4j.org

logical_log=nioneo_logical.log

online_backup_enabled=false

remote_logging_port=4560

gc_monitor_threshold=200ms

array_block_size=120

load_kernel_extensions=true

neostore.relationshipstore.db.mapped_memory=1000M

node_auto_indexing=false

intercept_committing_transactions=false

keep_logical_logs=true

dump_configuration=true

gc_monitor_wait_time=100ms

cache_type=none

intercept_deserialized_transactions=false

neostore.nodestore.db.mapped_memory=200M

neo4j.ext.udc.first_delay=600000

neo4j.ext.udc.reg=unreg

lucene_searcher_cache_size=2147483647

neo4j.ext.udc.interval=86400000

use_memory_mapped_buffers=true

rebuild_idgenerators_fast=true

neostore.propertystore.db.index.keys.mapped_memory=5M

neostore.propertystore.db.strings.mapped_memory=200M

neostore.propertystore.db.arrays.mapped_memory=130M

neo_store=neostore

logging.threshold_for_rotation=104857600

neostore.propertystore.db.index.mapped_memory=5M

backup_slave=false

neostore.propertystore.db.mapped_memory=2000M

gcr_cache_min_log_interval=60s

relationship_grab_size=100

relationship_auto_indexing=false

string_block_size=120

lucene_writer_cache_size=2147483647

node_cache_array_fraction=1.0

grab_file_lock=true

remote_logging_enabled=false

allow_store_upgrade=false

neo4j.ext.udc.enabled=true

execution_guard_enabled=false

relationship_cache_array_fraction=1.0

online_backup_port=6362


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Gergely Svigruha  
View profile  
 More options Oct 4 2012, 12:08 am
From: Gergely Svigruha <sgerg...@gmail.com>
Date: Wed, 3 Oct 2012 21:08:10 -0700 (PDT)
Local: Thurs, Oct 4 2012 12:08 am
Subject: Re: batch import performance breakdown after 30M edges

...when i set "neostore.relationshipstore.db.mapped_memory=4000M" it
remains fast (finishes the loading in 5 mins) so this solves it...

Btw do you have any estimation how much time does it take (hours, days,
weeks) to load a considerably huger graph (100M nodes, 10B edges) on a
Linux server with 32G RAM, a 4 core I9 processor and HDD disks?

Greg

2012. október 3., szerda 19:23:42 UTC+7 időpontban Gergely Svigruha a
következőt írta:


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Friso van Vollenhoven  
View profile  
 More options Oct 4 2012, 4:37 am
From: Friso van Vollenhoven <f.van.vollenho...@gmail.com>
Date: Thu, 4 Oct 2012 10:37:50 +0200
Local: Thurs, Oct 4 2012 4:37 am
Subject: Re: [Neo4j] Re: batch import performance breakdown after 30M edges

The problem is that not all of your data was fitting into memory with the
old setting. This brings us to the second question. It depends, whether all
of your data still fits in memory or not. If it does, the running time of
insertion should increase somewhat linearly with the number of nodes and
edges. If not, you will hit disk (a lot) and be orders of magnitude slower.

On a Linux box with 32GB RAM, creating a 80GB DB takes us easily 12 to 16
hours (because of paging in and out). This is only 30M nodes and 680M
relationships, but is heavy on properties (on both nodes and edges). This
is clearly disk IO (seek) bound. It's fast in the beginning, but then after
a while starts paging and slows down.

We are, as a side project, working on a distributed way to create a Neo4j
database using Hadoop MapReduce. We are slowly getting somewhere, but it's
far from complete. We will shout something to the list when it becomes
slightly useful.

Friso

...

read more »


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Michael Hunger  
View profile  
 More options Oct 4 2012, 4:59 am
From: Michael Hunger <michael.hun...@neotechnology.com>
Date: Thu, 4 Oct 2012 10:59:33 +0200
Local: Thurs, Oct 4 2012 4:59 am
Subject: Re: [Neo4j] Re: batch import performance breakdown after 30M edges

How many properties do you have on your nodes and relationships?

And how do you identify the nodes to connect?

Please note that your node-map will be limited wrt to memory space, perhaps you'd rather want to use an external service like redis or a more memory efficient collection (like the trove-collections or perhaps even a int-array).

You should try to use as much memory as possible for the memory mapping files, e.g. 20G in total in your case so that is still memory available for OS, OS-filesystem-caches and JVM heap (which should be around 4-8G).
see also:

http://docs.neo4j.org/chunked/milestone/configuration-io-examples.htm...
http://docs.neo4j.org/chunked/milestone/batchinsert.html

I think it also makes sense to pre-sort your edge-rows by start-id, end-id so that you hit similar memory-mapped windows during the import.
It makes also sense to do the cleanup of the csv files once and not on every import.

In general the batch-importer should be able to sustain the 1M nodes & edges per second.

If you have an SSD that helps a lot.

HTH

Michael

Am 04.10.2012 um 06:08 schrieb Gergely Svigruha:

...

read more »


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Gergely Svigruha  
View profile  
 More options Oct 4 2012, 5:01 am
From: Gergely Svigruha <sgerg...@gmail.com>
Date: Thu, 4 Oct 2012 02:01:26 -0700 (PDT)
Local: Thurs, Oct 4 2012 5:01 am
Subject: Re: [Neo4j] Re: batch import performance breakdown after 30M edges

Cool, thanks for the explanation. So when Neo4j claims to handle 30B nodes
and edges is it actually possible to build this enormous graph as of today,
or is it more like a theoretical upper bound? Or Neo4j could handle 30B
nodes if someone had the patience to wait until the graph is ready:)

I'm curious because I'm working on a project where we are going to have to
take relatively small traversals in a huge graph. I assume once the graph
is ready it won't take too much time to visit some neighbours of a
particular node using Neo4j no matter how huge the graph is, so the
challenge is to build the graph DB. Or do I miss something?

2012. október 4., csütörtök 15:37:52 UTC+7 időpontban Friso van Vollenhoven
a következőt írta:

...

read more »


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Gergely Svigruha  
View profile  
 More options Oct 4 2012, 5:20 am
From: Gergely Svigruha <sgerg...@gmail.com>
Date: Thu, 4 Oct 2012 02:20:51 -0700 (PDT)
Local: Thurs, Oct 4 2012 5:20 am
Subject: Re: [Neo4j] Re: batch import performance breakdown after 30M edges

The nodes will have names, unfortunately long Strings, the relationships
will probably have some dates and numbers (long). The node names are
probably going to have to be
indexed. I'm pretty sure that on the long run the graph is not going to fit
in the memory.

That pre-sort seams to be a good idea. When I use Neo4j how does it handle
the memory cache? Is there any way to make it more efficient not just for
import but traversals? For example, when I load one node is there any way
to improve the probability that the neighbours of the node are also going
to be cached? Or does it cache every node when it's first referenced?

The 1M edges / sec was also my experience until it had to do swapping.

2012. október 4., csütörtök 15:59:41 UTC+7 időpontban Michael Hunger a
következőt írta:

...

read more »


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Friso van Vollenhoven  
View profile  
 More options Oct 4 2012, 5:59 am
From: Friso van Vollenhoven <f.van.vollenho...@gmail.com>
Date: Thu, 4 Oct 2012 11:58:59 +0200
Local: Thurs, Oct 4 2012 5:58 am
Subject: Re: [Neo4j] Re: batch import performance breakdown after 30M edges

Hi MIchael,

We have two different databases, both representing financial networks. One
type has each individual transaction that ever happened (during the time
period being imported, usually 6 months) as edge properties using two
arrays of longs, one for timestamps and one for amount of money (amount in
cents, and yes, we need a long for that because it goes out of int range in
some cases).

I keep the mapping between domain ID ==> Neo4j Node ID in memory during the
import. Because my domain IDs are ints, I can use a Java array ( final
long[30000000] ) for this, which is as memory-efficient as it gets in the
JVM (no HashMap, so the object overhead is about 20 bytes and longs are 8
bytes so I get good alignment on x64 for free). I can get away with a 2.5G
heap for this.

Memory maps are maxed out, as you suggest, following roughly the size
distribution of the different files (nodes vs. edges). I don't map that
much of the properties files, but I think there shouldn't be that much
seeks there (just writing sequentially, as you only add properties).

Edge file is sorted but the other way around (end id, start id). I guess
this sorts roughly the same effect. Is this correct? Most of our very dense
nodes have a large in-degree and small out-degree. The financial network
has quite a few nodes with large degrees - millions - that connect
everything all over the place (tax collectors, utilities companies, large
telco's, etc.), which make it harder to get locality advantages all the
time.

I am not sure what you mean by 'cleanup the csv file'. Can you explain? We
read our csv over the network, so it doesn't pollute the local FS caches.

I easily get the 1M nodes / edges per second, as long as there is no paging
happening. My disk is 7200RPM SATA, so nothing fancy there. We typically
abuse one of our worker Hadoop nodes for doing Neo4j batch imports, so also
no RAID (even though the boxes have 12 of these disks).

One thing I haven't looked into, but perhaps you can explain is what the
role of caches is during batch insertion (with a write only work load).
Would it add anything? Or just compete for memory with the memory mapping?

I am not blaming Neo4j for being slow or anything. I am just assuming that
our box is not big enough for the work load. If I am doing something
totally wrong and it can be a lot faster, that'd be great of course.

Thanks,
Friso

On Thu, Oct 4, 2012 at 10:59 AM, Michael Hunger <

...

read more »


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Lasse Westh-Nielsen  
View profile  
 More options Oct 4 2012, 6:42 am
From: Lasse Westh-Nielsen <lasse.westh-niel...@neopersistence.com>
Date: Thu, 4 Oct 2012 11:42:29 +0100
Local: Thurs, Oct 4 2012 6:42 am
Subject: Re: [Neo4j] Re: batch import performance breakdown after 30M edges

On Thu, Oct 4, 2012 at 10:01 AM, Gergely Svigruha <sgerg...@gmail.com> wrote:
> Cool, thanks for the explanation. So when Neo4j claims to handle 30B nodes
> and edges is it actually possible to build this enormous graph as of today,
> or is it more like a theoretical upper bound? Or Neo4j could handle 30B
> nodes if someone had the patience to wait until the graph is ready:)

I believe that is just the size of the identifiers we currently use
(35 bits). If we allocate more bits, we can get even bigger DBs. So
yes, the problem would be actually filling the DB in the first place
:)

> I'm curious because I'm working on a project where we are going to have to
> take relatively small traversals in a huge graph. I assume once the graph is
> ready it won't take too much time to visit some neighbours of a particular
> node using Neo4j no matter how huge the graph is, so the challenge is to
> build the graph DB. Or do I miss something?

Nope, you are spot on, graph-local queries are a particular sweet spot
for Neo4j.

 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Marko Kevac  
View profile  
 More options Oct 15 2012, 3:55 am
From: Marko Kevac <ma...@kevac.org>
Date: Mon, 15 Oct 2012 00:55:28 -0700 (PDT)
Local: Mon, Oct 15 2012 3:55 am
Subject: Re: batch import performance breakdown after 30M edges

I have the same problem with 10 million nodes and 2 billion relationships.
It looks like this:

........................................................................... .........................
19633 ms for 10000000
........................................................................... .........................
20871 ms for 10000000
........................................................................... .........................
22767 ms for 10000000
........................................................................... .........................
23296 ms for 10000000
........................................................................... .........................
23286 ms for 10000000
........................................................................... .........................
23988 ms for 10000000
........................................................................... .........................
25374 ms for 10000000
........................................................................... .........................
1197765 ms for 10000000
........................................................................... .........................
8839674 ms for 10000000
........................................................................... .........................
15733633 ms for 10000000
........................................................................... .........................
17917691 ms for 10000000

Performance degradation is so drastic that batch importing is unusable.
What can I do?

I am using https://github.com/jexp/batch-import

iotop shows that java process is doing only approx 1Mb/sec writes. CPU is
almost always 0%. Memory used (RSS) is 22 Gb.
My server has 128Gb of RAM.

$ cat batch.properties
dump_configuration=true
cache_type=none
use_memory_mapped_buffers=true
neostore.propertystore.db.index.keys.mapped_memory=5G
neostore.propertystore.db.index.mapped_memory=5G
neostore.nodestore.db.mapped_memory=100G
neostore.relationshipstore.db.mapped_memory=80G
neostore.propertystore.db.mapped_memory=5G
neostore.propertystore.db.strings.mapped_memory=5G
#node_auto_indexing=true
#node_keys_indexable=Name

And I am using 40Gb heap (-Xmx40G).

...

read more »


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Paul Lam  
View profile  
 More options Oct 15 2012, 6:20 am
From: Paul Lam <paul....@forward.co.uk>
Date: Mon, 15 Oct 2012 03:20:50 -0700 (PDT)
Local: Mon, Oct 15 2012 6:20 am
Subject: Re: batch import performance breakdown after 30M edges

Not that this is a solution, but I noticed that your nodestore and
relationshipstore memory size not seem to be optimal and is set to be more
than heap
size. http://docs.neo4j.org/chunked/stable/configuration-io-examples.html#c...

...

read more »


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Peter Neubauer  
View profile  
 More options Oct 16 2012, 9:29 am
From: Peter Neubauer <peter.neuba...@neotechnology.com>
Date: Tue, 16 Oct 2012 15:28:47 +0200
Local: Tues, Oct 16 2012 9:28 am
Subject: Re: [Neo4j] Re: batch import performance breakdown after 30M edges
Gergely,
did you sort out the configuration? If not, let me ping you off list
for some support?

Cheers,

/peter neubauer

G:  neubauer.peter
S:  peter.neubauer
P:  +46 704 106975
L:   http://www.linkedin.com/in/neubauer
T:   @peterneubauer

Neo4j 1.8 GA - http://www.dzone.com/links/neo4j_18_release_fluent_graph_literacy.html

...

read more »


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Michael Hunger  
View profile  
 More options Oct 21 2012, 8:35 pm
From: Michael Hunger <michael.hun...@neotechnology.com>
Date: Mon, 22 Oct 2012 02:16:07 +0200
Local: Sun, Oct 21 2012 8:16 pm
Subject: Re: [Neo4j] Re: batch import performance breakdown after 30M edges

Marko,

it tries to map/unmap relationship-store-file-segments to memory for your relationships.

how many properties do you store for on your relationships? And which types

Can you list the current size of your store-files after the last import?

They will end up at 90M bytes (9 for nodes 66GB (33 bytes each record) for relationships and xx times 38 for properties (probably / 4).

Can you try to pre-sort the edges by startnode-endnode?

Your MMIO config doesn't work it declares too much memory (with the 100G for the nodes). I think adapting it to use about 100G in total distributed as following:
Please note that I changed the nodestore from Gigabytes to MegaBytes !

> dump_configuration=true
> cache_type=none
> use_memory_mapped_buffers=true
> neostore.propertystore.db.index.keys.mapped_memory=1G
> neostore.propertystore.db.index.mapped_memory=1G
> neostore.nodestore.db.mapped_memory=100M
> neostore.relationshipstore.db.mapped_memory=60G
> neostore.propertystore.db.mapped_memory=30G
> neostore.propertystore.db.strings.mapped_memory=10G

And running the JVM with 10G heap should be enough. You might also add -XX:NewSize=2G to have a larger young generation heap.

Am 15.10.2012 um 09:55 schrieb Marko Kevac:

...

read more »


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Michael Hunger  
View profile  
 More options Oct 21 2012, 8:35 pm
From: Michael Hunger <michael.hun...@neotechnology.com>
Date: Mon, 22 Oct 2012 02:26:17 +0200
Local: Sun, Oct 21 2012 8:26 pm
Subject: Re: [Neo4j] Re: batch import performance breakdown after 30M edges

Usually you would use a server with more RAM and SSD disks for that.

Make sure your mmio settings are adapted to your nodestore (1G) property-store (1-2G) and relationship-store (20G).

Usually importing nodes is fast, for importing rels it makes sense to pre-sort them by startnode-endnode so that the importer doesn't have to swap in/out rel-store-file-segments that often (this is what is expensive).

If it doesn't have to swap the segments you end up between 500k and 1M rels per second.

For your larger import you should probably also change your node-map to an gnu-trove int-to-int map or alternatively to an int-array so that it consumes less memory.

HTH

Michael

P.S. we should probably start offering import services for large neo4j datastores :)

Am 04.10.2012 um 06:08 schrieb Gergely Svigruha:


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Michael Hunger  
View profile  
 More options Oct 21 2012, 8:36 pm
From: Michael Hunger <michael.hun...@neotechnology.com>
Date: Mon, 22 Oct 2012 02:35:06 +0200
Local: Sun, Oct 21 2012 8:35 pm
Subject: Re: [Neo4j] Re: batch import performance breakdown after 30M edges

Hi Friso,

# You might even get away with an int-array (which is good enough for 2.4bn entries)
# You should probably try to map your node-store fully.
# Good question with the ordering, I think it might be ok too the main objective there is to reduce the random swap in/out.
# It might be sensible to keep the dense nodes up to the end?
# with csv file cleanup I mean the removal of quotes etc. which should rather be done once in the csv file (together with the sorting) so there is less string operation overhead
# On my mac with a well configured batch-inserter I have seen write speeds on an SSD of up to 170M/s
# switch caches off (cache_type=none)
# how long does it take to read your csv file (w/o creating neo4j stuff) over the network (are they gzipped) ? perhaps just put it on another disk. Good point with the fs-caches that's something to try out. Would probably be also interesting to read & prepare the input data on another thread and then have one dedicated thread for just writing neo data
# try to increase the new size e.g.  -XX:NewSize=2G to have less tenured space GC

HTH

Michael

Am 04.10.2012 um 11:58 schrieb Friso van Vollenhoven:

...

read more »


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Friso van Vollenhoven  
View profile  
 More options Oct 22 2012, 1:44 am
From: Friso van Vollenhoven <f.van.vollenho...@gmail.com>
Date: Mon, 22 Oct 2012 07:44:32 +0200
Local: Mon, Oct 22 2012 1:44 am
Subject: Re: [Neo4j] Re: batch import performance breakdown after 30M edges

Hi Michael,

Thanks a lot for the answer.

I did a trial run with the node store mapped fully. Improved a bit.

The CSV comes over the network and must come off another disk, as we take
the machine we use for the import out of the Hadoop cluster
responsibilities (we have to, to make sure we can use all the RAM). We
don't do compression, currently, but could. That's a good idea, though
(it's a config switch in Hadoop, so easy to implement). We also though
about the multi threaded approach, but didn't yet implement it. Right now
the importer just reads the csv over the wire in 100MB increments (buffer
size). Glad to hear I am doing the right thing with switching of caches in
neo.

I checked the GC pressure on the importer (using jstat) and it's not a lot.
There are no old gen collects happening during import.

Meanwhile, Kris tells me he has a working Hadoop based import job, that
creates the DB files in a distributed fashion (in about an hour, but there
is probably some room for improvement there). It creates different parts of
the files across the cluster and the you just concatenate those at the end
of the job. This could be a nice starting point for that service of yours...

(I also heard a rumor that Kris is working on a blog post about this
approach. Stay tuned.)

Cheers,
Friso

BCC: Kris

On Mon, Oct 22, 2012 at 2:35 AM, Michael Hunger <

...

read more »


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
End of messages
« Back to Discussions « Newer topic     Older topic »