I use neo4j community 1.8 on a Linux 3.4.6-2.10-desktop machine (4core 32G RAM), my JDK version is 1.7.0_04. I start the JVM with max 4G RAM (-Xmx4096m). I inserted a graph with 1M vertices and 70M edges, stored in two CSV files. I've observed that after inserting 30M edges the performance breaks down. For the first 30M edges it takes avg 2 seconds to insert 1M edge, after that it takes 40-60 second per 1M edges. What can be the cause of it and how can I improve this performance? Thanks!
...when i set "neostore.relationshipstore.db.mapped_memory=4000M" it remains fast (finishes the loading in 5 mins) so this solves it...
Btw do you have any estimation how much time does it take (hours, days, weeks) to load a considerably huger graph (100M nodes, 10B edges) on a Linux server with 32G RAM, a 4 core I9 processor and HDD disks?
Greg
2012. október 3., szerda 19:23:42 UTC+7 időpontban Gergely Svigruha a következőt írta:
> I use neo4j community 1.8 on a Linux 3.4.6-2.10-desktop machine (4core 32G > RAM), my JDK version is 1.7.0_04. I start the JVM with max 4G RAM > (-Xmx4096m). I inserted a graph with 1M vertices and 70M edges, stored in > two CSV files. I've observed that after inserting 30M edges > the performance breaks down. For the first 30M edges it takes avg 2 seconds > to insert 1M edge, after that it takes 40-60 second per 1M edges. What can > be the cause of it and how can I improve this performance?
> Thanks!
The problem is that not all of your data was fitting into memory with the
old setting. This brings us to the second question. It depends, whether all
of your data still fits in memory or not. If it does, the running time of
insertion should increase somewhat linearly with the number of nodes and
edges. If not, you will hit disk (a lot) and be orders of magnitude slower.
On a Linux box with 32GB RAM, creating a 80GB DB takes us easily 12 to 16
hours (because of paging in and out). This is only 30M nodes and 680M
relationships, but is heavy on properties (on both nodes and edges). This
is clearly disk IO (seek) bound. It's fast in the beginning, but then after
a while starts paging and slows down.
We are, as a side project, working on a distributed way to create a Neo4j
database using Hadoop MapReduce. We are slowly getting somewhere, but it's
far from complete. We will shout something to the list when it becomes
slightly useful.
On Thu, Oct 4, 2012 at 6:08 AM, Gergely Svigruha <sgerg...@gmail.com> wrote:
> ...when i set "neostore.relationshipstore.db.mapped_memory=4000M" it
> remains fast (finishes the loading in 5 mins) so this solves it...
> Btw do you have any estimation how much time does it take (hours, days,
> weeks) to load a considerably huger graph (100M nodes, 10B edges) on a
> Linux server with 32G RAM, a 4 core I9 processor and HDD disks?
> Greg
> 2012. október 3., szerda 19:23:42 UTC+7 időpontban Gergely Svigruha a
> következőt írta:
>> Hi,
>> I use neo4j community 1.8 on a Linux 3.4.6-2.10-desktop machine (4core
>> 32G RAM), my JDK version is 1.7.0_04. I start the JVM with max 4G RAM
>> (-Xmx4096m). I inserted a graph with 1M vertices and 70M edges, stored in
>> two CSV files. I've observed that after inserting 30M edges
>> the performance breaks down. For the first 30M edges it takes avg 2 seconds
>> to insert 1M edge, after that it takes 40-60 second per 1M edges. What can
>> be the cause of it and how can I improve this performance?
>> Thanks!
How many properties do you have on your nodes and relationships?
And how do you identify the nodes to connect?
Please note that your node-map will be limited wrt to memory space, perhaps you'd rather want to use an external service like redis or a more memory efficient collection (like the trove-collections or perhaps even a int-array).
You should try to use as much memory as possible for the memory mapping files, e.g. 20G in total in your case so that is still memory available for OS, OS-filesystem-caches and JVM heap (which should be around 4-8G).
see also:
I think it also makes sense to pre-sort your edge-rows by start-id, end-id so that you hit similar memory-mapped windows during the import.
It makes also sense to do the cleanup of the csv files once and not on every import.
In general the batch-importer should be able to sustain the 1M nodes & edges per second.
> ...when i set "neostore.relationshipstore.db.mapped_memory=4000M" it remains fast (finishes the loading in 5 mins) so this solves it...
> Btw do you have any estimation how much time does it take (hours, days, weeks) to load a considerably huger graph (100M nodes, 10B edges) on a Linux server with 32G RAM, a 4 core I9 processor and HDD disks?
> Greg
> 2012. október 3., szerda 19:23:42 UTC+7 időpontban Gergely Svigruha a következőt írta:
> Hi,
> I use neo4j community 1.8 on a Linux 3.4.6-2.10-desktop machine (4core 32G RAM), my JDK version is 1.7.0_04. I start the JVM with max 4G RAM (-Xmx4096m). I inserted a graph with 1M vertices and 70M edges, stored in two CSV files. I've observed that after inserting 30M edges the performance breaks down. For the first 30M edges it takes avg 2 seconds to insert 1M edge, after that it takes 40-60 second per 1M edges. What can be the cause of it and how can I improve this performance?
> Thanks!
Cool, thanks for the explanation. So when Neo4j claims to handle 30B nodes and edges is it actually possible to build this enormous graph as of today, or is it more like a theoretical upper bound? Or Neo4j could handle 30B nodes if someone had the patience to wait until the graph is ready:)
I'm curious because I'm working on a project where we are going to have to take relatively small traversals in a huge graph. I assume once the graph is ready it won't take too much time to visit some neighbours of a particular node using Neo4j no matter how huge the graph is, so the challenge is to build the graph DB. Or do I miss something?
2012. október 4., csütörtök 15:37:52 UTC+7 időpontban Friso van Vollenhoven a következőt írta:
> The problem is that not all of your data was fitting into memory with the > old setting. This brings us to the second question. It depends, whether all > of your data still fits in memory or not. If it does, the running time of > insertion should increase somewhat linearly with the number of nodes and > edges. If not, you will hit disk (a lot) and be orders of magnitude slower.
> On a Linux box with 32GB RAM, creating a 80GB DB takes us easily 12 to 16 > hours (because of paging in and out). This is only 30M nodes and 680M > relationships, but is heavy on properties (on both nodes and edges). This > is clearly disk IO (seek) bound. It's fast in the beginning, but then after > a while starts paging and slows down.
> We are, as a side project, working on a distributed way to create a Neo4j > database using Hadoop MapReduce. We are slowly getting somewhere, but it's > far from complete. We will shout something to the list when it becomes > slightly useful.
> Friso
> On Thu, Oct 4, 2012 at 6:08 AM, Gergely Svigruha <sger...@gmail.com<javascript:>
> > wrote:
>> ...when i set "neostore.relationshipstore.db.mapped_memory=4000M" it >> remains fast (finishes the loading in 5 mins) so this solves it...
>> Btw do you have any estimation how much time does it take (hours, days, >> weeks) to load a considerably huger graph (100M nodes, 10B edges) on a >> Linux server with 32G RAM, a 4 core I9 processor and HDD disks?
>> Greg
>> 2012. október 3., szerda 19:23:42 UTC+7 időpontban Gergely Svigruha a >> következőt írta:
>>> Hi,
>>> I use neo4j community 1.8 on a Linux 3.4.6-2.10-desktop machine (4core >>> 32G RAM), my JDK version is 1.7.0_04. I start the JVM with max 4G RAM >>> (-Xmx4096m). I inserted a graph with 1M vertices and 70M edges, stored in >>> two CSV files. I've observed that after inserting 30M edges >>> the performance breaks down. For the first 30M edges it takes avg 2 seconds >>> to insert 1M edge, after that it takes 40-60 second per 1M edges. What can >>> be the cause of it and how can I improve this performance?
>>> Thanks!
The nodes will have names, unfortunately long Strings, the relationships will probably have some dates and numbers (long). The node names are probably going to have to be indexed. I'm pretty sure that on the long run the graph is not going to fit in the memory.
That pre-sort seams to be a good idea. When I use Neo4j how does it handle the memory cache? Is there any way to make it more efficient not just for import but traversals? For example, when I load one node is there any way to improve the probability that the neighbours of the node are also going to be cached? Or does it cache every node when it's first referenced?
The 1M edges / sec was also my experience until it had to do swapping.
2012. október 4., csütörtök 15:59:41 UTC+7 időpontban Michael Hunger a következőt írta:
> How many properties do you have on your nodes and relationships?
> And how do you identify the nodes to connect?
> Please note that your node-map will be limited wrt to memory space, > perhaps you'd rather want to use an external service like redis or a more > memory efficient collection (like the trove-collections or perhaps even a > int-array).
> You should try to use as much memory as possible for the memory mapping > files, e.g. 20G in total in your case so that is still memory available for > OS, OS-filesystem-caches and JVM heap (which should be around 4-8G).
> see also:
> I think it also makes sense to pre-sort your edge-rows by start-id, end-id > so that you hit similar memory-mapped windows during the import.
> It makes also sense to do the cleanup of the csv files once and not on > every import.
> In general the batch-importer should be able to sustain the 1M nodes & > edges per second.
> If you have an SSD that helps a lot.
> HTH
> Michael
> Am 04.10.2012 um 06:08 schrieb Gergely Svigruha:
> ...when i set "neostore.relationshipstore.db.mapped_memory=4000M" it > remains fast (finishes the loading in 5 mins) so this solves it...
> Btw do you have any estimation how much time does it take (hours, days, > weeks) to load a considerably huger graph (100M nodes, 10B edges) on a > Linux server with 32G RAM, a 4 core I9 processor and HDD disks?
> Greg
> 2012. október 3., szerda 19:23:42 UTC+7 időpontban Gergely Svigruha a > következőt írta:
>> Hi,
>> I use neo4j community 1.8 on a Linux 3.4.6-2.10-desktop machine (4core >> 32G RAM), my JDK version is 1.7.0_04. I start the JVM with max 4G RAM >> (-Xmx4096m). I inserted a graph with 1M vertices and 70M edges, stored in >> two CSV files. I've observed that after inserting 30M edges >> the performance breaks down. For the first 30M edges it takes avg 2 seconds >> to insert 1M edge, after that it takes 40-60 second per 1M edges. What can >> be the cause of it and how can I improve this performance?
>> Thanks!
We have two different databases, both representing financial networks. One
type has each individual transaction that ever happened (during the time
period being imported, usually 6 months) as edge properties using two
arrays of longs, one for timestamps and one for amount of money (amount in
cents, and yes, we need a long for that because it goes out of int range in
some cases).
I keep the mapping between domain ID ==> Neo4j Node ID in memory during the
import. Because my domain IDs are ints, I can use a Java array ( final
long[30000000] ) for this, which is as memory-efficient as it gets in the
JVM (no HashMap, so the object overhead is about 20 bytes and longs are 8
bytes so I get good alignment on x64 for free). I can get away with a 2.5G
heap for this.
Memory maps are maxed out, as you suggest, following roughly the size
distribution of the different files (nodes vs. edges). I don't map that
much of the properties files, but I think there shouldn't be that much
seeks there (just writing sequentially, as you only add properties).
Edge file is sorted but the other way around (end id, start id). I guess
this sorts roughly the same effect. Is this correct? Most of our very dense
nodes have a large in-degree and small out-degree. The financial network
has quite a few nodes with large degrees - millions - that connect
everything all over the place (tax collectors, utilities companies, large
telco's, etc.), which make it harder to get locality advantages all the
time.
I am not sure what you mean by 'cleanup the csv file'. Can you explain? We
read our csv over the network, so it doesn't pollute the local FS caches.
I easily get the 1M nodes / edges per second, as long as there is no paging
happening. My disk is 7200RPM SATA, so nothing fancy there. We typically
abuse one of our worker Hadoop nodes for doing Neo4j batch imports, so also
no RAID (even though the boxes have 12 of these disks).
One thing I haven't looked into, but perhaps you can explain is what the
role of caches is during batch insertion (with a write only work load).
Would it add anything? Or just compete for memory with the memory mapping?
I am not blaming Neo4j for being slow or anything. I am just assuming that
our box is not big enough for the work load. If I am doing something
totally wrong and it can be a lot faster, that'd be great of course.
michael.hun...@neotechnology.com> wrote:
> How many properties do you have on your nodes and relationships?
> And how do you identify the nodes to connect?
> Please note that your node-map will be limited wrt to memory space,
> perhaps you'd rather want to use an external service like redis or a more
> memory efficient collection (like the trove-collections or perhaps even a
> int-array).
> You should try to use as much memory as possible for the memory mapping
> files, e.g. 20G in total in your case so that is still memory available for
> OS, OS-filesystem-caches and JVM heap (which should be around 4-8G).
> see also:
> I think it also makes sense to pre-sort your edge-rows by start-id, end-id
> so that you hit similar memory-mapped windows during the import.
> It makes also sense to do the cleanup of the csv files once and not on
> every import.
> In general the batch-importer should be able to sustain the 1M nodes &
> edges per second.
> If you have an SSD that helps a lot.
> HTH
> Michael
> Am 04.10.2012 um 06:08 schrieb Gergely Svigruha:
> ...when i set "neostore.relationshipstore.db.mapped_memory=4000M" it
> remains fast (finishes the loading in 5 mins) so this solves it...
> Btw do you have any estimation how much time does it take (hours, days,
> weeks) to load a considerably huger graph (100M nodes, 10B edges) on a
> Linux server with 32G RAM, a 4 core I9 processor and HDD disks?
> Greg
> 2012. október 3., szerda 19:23:42 UTC+7 időpontban Gergely Svigruha a
> következőt írta:
>> Hi,
>> I use neo4j community 1.8 on a Linux 3.4.6-2.10-desktop machine (4core
>> 32G RAM), my JDK version is 1.7.0_04. I start the JVM with max 4G RAM
>> (-Xmx4096m). I inserted a graph with 1M vertices and 70M edges, stored in
>> two CSV files. I've observed that after inserting 30M edges
>> the performance breaks down. For the first 30M edges it takes avg 2 seconds
>> to insert 1M edge, after that it takes 40-60 second per 1M edges. What can
>> be the cause of it and how can I improve this performance?
>> Thanks!
On Thu, Oct 4, 2012 at 10:01 AM, Gergely Svigruha <sgerg...@gmail.com> wrote:
> Cool, thanks for the explanation. So when Neo4j claims to handle 30B nodes
> and edges is it actually possible to build this enormous graph as of today,
> or is it more like a theoretical upper bound? Or Neo4j could handle 30B
> nodes if someone had the patience to wait until the graph is ready:)
I believe that is just the size of the identifiers we currently use
(35 bits). If we allocate more bits, we can get even bigger DBs. So
yes, the problem would be actually filling the DB in the first place
:)
> I'm curious because I'm working on a project where we are going to have to
> take relatively small traversals in a huge graph. I assume once the graph is
> ready it won't take too much time to visit some neighbours of a particular
> node using Neo4j no matter how huge the graph is, so the challenge is to
> build the graph DB. Or do I miss something?
Nope, you are spot on, graph-local queries are a particular sweet spot
for Neo4j.
I have the same problem with 10 million nodes and 2 billion relationships. It looks like this:
........................................................................... ......................... 19633 ms for 10000000 ........................................................................... ......................... 20871 ms for 10000000 ........................................................................... ......................... 22767 ms for 10000000 ........................................................................... ......................... 23296 ms for 10000000 ........................................................................... ......................... 23286 ms for 10000000 ........................................................................... ......................... 23988 ms for 10000000 ........................................................................... ......................... 25374 ms for 10000000 ........................................................................... ......................... 1197765 ms for 10000000 ........................................................................... ......................... 8839674 ms for 10000000 ........................................................................... ......................... 15733633 ms for 10000000 ........................................................................... ......................... 17917691 ms for 10000000
Performance degradation is so drastic that batch importing is unusable. What can I do?
On Wednesday, October 3, 2012 4:23:42 PM UTC+4, Gergely Svigruha wrote:
> Hi,
> I use neo4j community 1.8 on a Linux 3.4.6-2.10-desktop machine (4core 32G > RAM), my JDK version is 1.7.0_04. I start the JVM with max 4G RAM > (-Xmx4096m). I inserted a graph with 1M vertices and 70M edges, stored in > two CSV files. I've observed that after inserting 30M edges > the performance breaks down. For the first 30M edges it takes avg 2 seconds > to insert 1M edge, after that it takes 40-60 second per 1M edges. What can > be the cause of it and how can I improve this performance? > Thanks!
On Monday, October 15, 2012 8:55:28 AM UTC+1, Marko Kevac wrote:
> I have the same problem with 10 million nodes and 2 billion relationships. > It looks like this:
> ........................................................................... ......................... > 19633 ms for 10000000 > ........................................................................... ......................... > 20871 ms for 10000000 > ........................................................................... ......................... > 22767 ms for 10000000 > ........................................................................... ......................... > 23296 ms for 10000000 > ........................................................................... ......................... > 23286 ms for 10000000 > ........................................................................... ......................... > 23988 ms for 10000000 > ........................................................................... ......................... > 25374 ms for 10000000 > ........................................................................... ......................... > 1197765 ms for 10000000 > ........................................................................... ......................... > 8839674 ms for 10000000 > ........................................................................... ......................... > 15733633 ms for 10000000 > ........................................................................... ......................... > 17917691 ms for 10000000
> Performance degradation is so drastic that batch importing is unusable. > What can I do?
> iotop shows that java process is doing only approx 1Mb/sec writes. CPU is > almost always 0%. Memory used (RSS) is 22 Gb. > My server has 128Gb of RAM.
> On Wednesday, October 3, 2012 4:23:42 PM UTC+4, Gergely Svigruha wrote:
>> Hi,
>> I use neo4j community 1.8 on a Linux 3.4.6-2.10-desktop machine (4core >> 32G RAM), my JDK version is 1.7.0_04. I start the JVM with max 4G RAM >> (-Xmx4096m). I inserted a graph with 1M vertices and 70M edges, stored in >> two CSV files. I've observed that after inserting 30M edges >> the performance breaks down. For the first 30M edges it takes avg 2 seconds >> to insert 1M edge, after that it takes 40-60 second per 1M edges. What can >> be the cause of it and how can I improve this performance? >> Thanks!
On Mon, Oct 15, 2012 at 12:20 PM, Paul Lam <paul....@forward.co.uk> wrote:
> Not that this is a solution, but I noticed that your nodestore and
> relationshipstore memory size not seem to be optimal and is set to be more
> than heap size.
> http://docs.neo4j.org/chunked/stable/configuration-io-examples.html#c...
> On Monday, October 15, 2012 8:55:28 AM UTC+1, Marko Kevac wrote:
>> I have the same problem with 10 million nodes and 2 billion relationships.
>> It looks like this:
>> ........................................................................... .........................
>> 19633 ms for 10000000
>> ........................................................................... .........................
>> 20871 ms for 10000000
>> ........................................................................... .........................
>> 22767 ms for 10000000
>> ........................................................................... .........................
>> 23296 ms for 10000000
>> ........................................................................... .........................
>> 23286 ms for 10000000
>> ........................................................................... .........................
>> 23988 ms for 10000000
>> ........................................................................... .........................
>> 25374 ms for 10000000
>> ........................................................................... .........................
>> 1197765 ms for 10000000
>> ........................................................................... .........................
>> 8839674 ms for 10000000
>> ........................................................................... .........................
>> 15733633 ms for 10000000
>> ........................................................................... .........................
>> 17917691 ms for 10000000
>> Performance degradation is so drastic that batch importing is unusable.
>> What can I do?
>> iotop shows that java process is doing only approx 1Mb/sec writes. CPU is
>> almost always 0%. Memory used (RSS) is 22 Gb.
>> My server has 128Gb of RAM.
>> On Wednesday, October 3, 2012 4:23:42 PM UTC+4, Gergely Svigruha wrote:
>>> Hi,
>>> I use neo4j community 1.8 on a Linux 3.4.6-2.10-desktop machine (4core
>>> 32G RAM), my JDK version is 1.7.0_04. I start the JVM with max 4G RAM
>>> (-Xmx4096m). I inserted a graph with 1M vertices and 70M edges, stored in
>>> two CSV files. I've observed that after inserting 30M edges the performance
>>> breaks down. For the first 30M edges it takes avg 2 seconds to insert 1M
>>> edge, after that it takes 40-60 second per 1M edges. What can be the cause
>>> of it and how can I improve this performance?
>>> Thanks!
it tries to map/unmap relationship-store-file-segments to memory for your relationships.
how many properties do you store for on your relationships? And which types
Can you list the current size of your store-files after the last import?
They will end up at 90M bytes (9 for nodes 66GB (33 bytes each record) for relationships and xx times 38 for properties (probably / 4).
Can you try to pre-sort the edges by startnode-endnode?
Your MMIO config doesn't work it declares too much memory (with the 100G for the nodes). I think adapting it to use about 100G in total distributed as following:
Please note that I changed the nodestore from Gigabytes to MegaBytes !
> I have the same problem with 10 million nodes and 2 billion relationships. It looks like this:
> ........................................................................... ......................... 19633 ms for 10000000
> ........................................................................... ......................... 20871 ms for 10000000
> ........................................................................... ......................... 22767 ms for 10000000
> ........................................................................... ......................... 23296 ms for 10000000
> ........................................................................... ......................... 23286 ms for 10000000
> ........................................................................... ......................... 23988 ms for 10000000
> ........................................................................... ......................... 25374 ms for 10000000
> ........................................................................... ......................... 1197765 ms for 10000000
> ........................................................................... ......................... 8839674 ms for 10000000
> ........................................................................... ......................... 15733633 ms for 10000000
> ........................................................................... ......................... 17917691 ms for 10000000
> Performance degradation is so drastic that batch importing is unusable. What can I do?
> iotop shows that java process is doing only approx 1Mb/sec writes. CPU is almost always 0%. Memory used (RSS) is 22 Gb.
> My server has 128Gb of RAM.
> On Wednesday, October 3, 2012 4:23:42 PM UTC+4, Gergely Svigruha wrote:
> Hi,
> I use neo4j community 1.8 on a Linux 3.4.6-2.10-desktop machine (4core 32G RAM), my JDK version is 1.7.0_04. I start the JVM with max 4G RAM (-Xmx4096m). I inserted a graph with 1M vertices and 70M edges, stored in two CSV files. I've observed that after inserting 30M edges the performance breaks down. For the first 30M edges it takes avg 2 seconds to insert 1M edge, after that it takes 40-60 second per 1M edges. What can be the cause of it and how can I improve this performance?
> Thanks!
Usually you would use a server with more RAM and SSD disks for that.
Make sure your mmio settings are adapted to your nodestore (1G) property-store (1-2G) and relationship-store (20G).
Usually importing nodes is fast, for importing rels it makes sense to pre-sort them by startnode-endnode so that the importer doesn't have to swap in/out rel-store-file-segments that often (this is what is expensive).
If it doesn't have to swap the segments you end up between 500k and 1M rels per second.
For your larger import you should probably also change your node-map to an gnu-trove int-to-int map or alternatively to an int-array so that it consumes less memory.
HTH
Michael
P.S. we should probably start offering import services for large neo4j datastores :)
> ...when i set "neostore.relationshipstore.db.mapped_memory=4000M" it remains fast (finishes the loading in 5 mins) so this solves it...
> Btw do you have any estimation how much time does it take (hours, days, weeks) to load a considerably huger graph (100M nodes, 10B edges) on a Linux server with 32G RAM, a 4 core I9 processor and HDD disks?
> Greg
> 2012. október 3., szerda 19:23:42 UTC+7 időpontban Gergely Svigruha a következőt írta:
> Hi,
> I use neo4j community 1.8 on a Linux 3.4.6-2.10-desktop machine (4core 32G RAM), my JDK version is 1.7.0_04. I start the JVM with max 4G RAM (-Xmx4096m). I inserted a graph with 1M vertices and 70M edges, stored in two CSV files. I've observed that after inserting 30M edges the performance breaks down. For the first 30M edges it takes avg 2 seconds to insert 1M edge, after that it takes 40-60 second per 1M edges. What can be the cause of it and how can I improve this performance?
> Thanks!
# You might even get away with an int-array (which is good enough for 2.4bn entries)
# You should probably try to map your node-store fully.
# Good question with the ordering, I think it might be ok too the main objective there is to reduce the random swap in/out.
# It might be sensible to keep the dense nodes up to the end?
# with csv file cleanup I mean the removal of quotes etc. which should rather be done once in the csv file (together with the sorting) so there is less string operation overhead
# On my mac with a well configured batch-inserter I have seen write speeds on an SSD of up to 170M/s
# switch caches off (cache_type=none)
# how long does it take to read your csv file (w/o creating neo4j stuff) over the network (are they gzipped) ? perhaps just put it on another disk. Good point with the fs-caches that's something to try out. Would probably be also interesting to read & prepare the input data on another thread and then have one dedicated thread for just writing neo data
# try to increase the new size e.g. -XX:NewSize=2G to have less tenured space GC
HTH
Michael
Am 04.10.2012 um 11:58 schrieb Friso van Vollenhoven:
> We have two different databases, both representing financial networks. One type has each individual transaction that ever happened (during the time period being imported, usually 6 months) as edge properties using two arrays of longs, one for timestamps and one for amount of money (amount in cents, and yes, we need a long for that because it goes out of int range in some cases).
> I keep the mapping between domain ID ==> Neo4j Node ID in memory during the import. Because my domain IDs are ints, I can use a Java array ( final long[30000000] ) for this, which is as memory-efficient as it gets in the JVM (no HashMap, so the object overhead is about 20 bytes and longs are 8 bytes so I get good alignment on x64 for free). I can get away with a 2.5G heap for this.
> Memory maps are maxed out, as you suggest, following roughly the size distribution of the different files (nodes vs. edges). I don't map that much of the properties files, but I think there shouldn't be that much seeks there (just writing sequentially, as you only add properties).
> Edge file is sorted but the other way around (end id, start id). I guess this sorts roughly the same effect. Is this correct? Most of our very dense nodes have a large in-degree and small out-degree. The financial network has quite a few nodes with large degrees - millions - that connect everything all over the place (tax collectors, utilities companies, large telco's, etc.), which make it harder to get locality advantages all the time.
> I am not sure what you mean by 'cleanup the csv file'. Can you explain? We read our csv over the network, so it doesn't pollute the local FS caches.
> I easily get the 1M nodes / edges per second, as long as there is no paging happening. My disk is 7200RPM SATA, so nothing fancy there. We typically abuse one of our worker Hadoop nodes for doing Neo4j batch imports, so also no RAID (even though the boxes have 12 of these disks).
> One thing I haven't looked into, but perhaps you can explain is what the role of caches is during batch insertion (with a write only work load). Would it add anything? Or just compete for memory with the memory mapping?
> I am not blaming Neo4j for being slow or anything. I am just assuming that our box is not big enough for the work load. If I am doing something totally wrong and it can be a lot faster, that'd be great of course.
> Thanks,
> Friso
> On Thu, Oct 4, 2012 at 10:59 AM, Michael Hunger <michael.hun...@neotechnology.com> wrote:
> How many properties do you have on your nodes and relationships?
> And how do you identify the nodes to connect?
> Please note that your node-map will be limited wrt to memory space, perhaps you'd rather want to use an external service like redis or a more memory efficient collection (like the trove-collections or perhaps even a int-array).
> You should try to use as much memory as possible for the memory mapping files, e.g. 20G in total in your case so that is still memory available for OS, OS-filesystem-caches and JVM heap (which should be around 4-8G).
> see also:
> I think it also makes sense to pre-sort your edge-rows by start-id, end-id so that you hit similar memory-mapped windows during the import.
> It makes also sense to do the cleanup of the csv files once and not on every import.
> In general the batch-importer should be able to sustain the 1M nodes & edges per second.
> If you have an SSD that helps a lot.
> HTH
> Michael
> Am 04.10.2012 um 06:08 schrieb Gergely Svigruha:
>> ...when i set "neostore.relationshipstore.db.mapped_memory=4000M" it remains fast (finishes the loading in 5 mins) so this solves it...
>> Btw do you have any estimation how much time does it take (hours, days, weeks) to load a considerably huger graph (100M nodes, 10B edges) on a Linux server with 32G RAM, a 4 core I9 processor and HDD disks?
>> Greg
>> 2012. október 3., szerda 19:23:42 UTC+7 időpontban Gergely Svigruha a következőt írta:
>> Hi,
>> I use neo4j community 1.8 on a Linux 3.4.6-2.10-desktop machine (4core 32G RAM), my JDK version is 1.7.0_04. I start the JVM with max 4G RAM (-Xmx4096m). I inserted a graph with 1M vertices and 70M edges, stored in two CSV files. I've observed that after inserting 30M edges the performance breaks down. For the first 30M edges it takes avg 2 seconds to insert 1M edge, after that it takes 40-60 second per 1M edges. What can be the cause of it and how can I improve this performance?
>> Thanks!
I did a trial run with the node store mapped fully. Improved a bit.
The CSV comes over the network and must come off another disk, as we take
the machine we use for the import out of the Hadoop cluster
responsibilities (we have to, to make sure we can use all the RAM). We
don't do compression, currently, but could. That's a good idea, though
(it's a config switch in Hadoop, so easy to implement). We also though
about the multi threaded approach, but didn't yet implement it. Right now
the importer just reads the csv over the wire in 100MB increments (buffer
size). Glad to hear I am doing the right thing with switching of caches in
neo.
I checked the GC pressure on the importer (using jstat) and it's not a lot.
There are no old gen collects happening during import.
Meanwhile, Kris tells me he has a working Hadoop based import job, that
creates the DB files in a distributed fashion (in about an hour, but there
is probably some room for improvement there). It creates different parts of
the files across the cluster and the you just concatenate those at the end
of the job. This could be a nice starting point for that service of yours...
(I also heard a rumor that Kris is working on a blog post about this
approach. Stay tuned.)
michael.hun...@neotechnology.com> wrote:
> Hi Friso,
> # You might even get away with an int-array (which is good enough for
> 2.4bn entries)
> # You should probably try to map your node-store fully.
> # Good question with the ordering, I think it might be ok too the main
> objective there is to reduce the random swap in/out.
> # It might be sensible to keep the dense nodes up to the end?
> # with csv file cleanup I mean the removal of quotes etc. which should
> rather be done once in the csv file (together with the sorting) so there is
> less string operation overhead
> # On my mac with a well configured batch-inserter I have seen write speeds
> on an SSD of up to 170M/s
> # switch caches off (cache_type=none)
> # how long does it take to read your csv file (w/o creating neo4j stuff)
> over the network (are they gzipped) ? perhaps just put it on another disk.
> Good point with the fs-caches that's something to try out. Would probably
> be also interesting to read & prepare the input data on another thread and
> then have one dedicated thread for just writing neo data
> # try to increase the new size e.g. -XX:NewSize=2G to have less tenured
> space GC
> HTH
> Michael
> Am 04.10.2012 um 11:58 schrieb Friso van Vollenhoven:
> Hi MIchael,
> We have two different databases, both representing financial networks. One
> type has each individual transaction that ever happened (during the time
> period being imported, usually 6 months) as edge properties using two
> arrays of longs, one for timestamps and one for amount of money (amount in
> cents, and yes, we need a long for that because it goes out of int range in
> some cases).
> I keep the mapping between domain ID ==> Neo4j Node ID in memory during
> the import. Because my domain IDs are ints, I can use a Java array ( final
> long[30000000] ) for this, which is as memory-efficient as it gets in the
> JVM (no HashMap, so the object overhead is about 20 bytes and longs are 8
> bytes so I get good alignment on x64 for free). I can get away with a 2.5G
> heap for this.
> Memory maps are maxed out, as you suggest, following roughly the size
> distribution of the different files (nodes vs. edges). I don't map that
> much of the properties files, but I think there shouldn't be that much
> seeks there (just writing sequentially, as you only add properties).
> Edge file is sorted but the other way around (end id, start id). I guess
> this sorts roughly the same effect. Is this correct? Most of our very dense
> nodes have a large in-degree and small out-degree. The financial network
> has quite a few nodes with large degrees - millions - that connect
> everything all over the place (tax collectors, utilities companies, large
> telco's, etc.), which make it harder to get locality advantages all the
> time.
> I am not sure what you mean by 'cleanup the csv file'. Can you explain? We
> read our csv over the network, so it doesn't pollute the local FS caches.
> I easily get the 1M nodes / edges per second, as long as there is no
> paging happening. My disk is 7200RPM SATA, so nothing fancy there. We
> typically abuse one of our worker Hadoop nodes for doing Neo4j batch
> imports, so also no RAID (even though the boxes have 12 of these disks).
> One thing I haven't looked into, but perhaps you can explain is what the
> role of caches is during batch insertion (with a write only work load).
> Would it add anything? Or just compete for memory with the memory mapping?
> I am not blaming Neo4j for being slow or anything. I am just assuming that
> our box is not big enough for the work load. If I am doing something
> totally wrong and it can be a lot faster, that'd be great of course.
> Thanks,
> Friso
> On Thu, Oct 4, 2012 at 10:59 AM, Michael Hunger <
> michael.hun...@neotechnology.com> wrote:
>> How many properties do you have on your nodes and relationships?
>> And how do you identify the nodes to connect?
>> Please note that your node-map will be limited wrt to memory space,
>> perhaps you'd rather want to use an external service like redis or a more
>> memory efficient collection (like the trove-collections or perhaps even a
>> int-array).
>> You should try to use as much memory as possible for the memory mapping
>> files, e.g. 20G in total in your case so that is still memory available for
>> OS, OS-filesystem-caches and JVM heap (which should be around 4-8G).
>> see also:
>> I think it also makes sense to pre-sort your edge-rows by start-id,
>> end-id so that you hit similar memory-mapped windows during the import.
>> It makes also sense to do the cleanup of the csv files once and not on
>> every import.
>> In general the batch-importer should be able to sustain the 1M nodes &
>> edges per second.
>> If you have an SSD that helps a lot.
>> HTH
>> Michael
>> Am 04.10.2012 um 06:08 schrieb Gergely Svigruha:
>> ...when i set "neostore.relationshipstore.db.mapped_memory=4000M" it
>> remains fast (finishes the loading in 5 mins) so this solves it...
>> Btw do you have any estimation how much time does it take (hours, days,
>> weeks) to load a considerably huger graph (100M nodes, 10B edges) on a
>> Linux server with 32G RAM, a 4 core I9 processor and HDD disks?
>> Greg
>> 2012. október 3., szerda 19:23:42 UTC+7 időpontban Gergely Svigruha a
>> következőt írta:
>>> Hi,
>>> I use neo4j community 1.8 on a Linux 3.4.6-2.10-desktop machine (4core
>>> 32G RAM), my JDK version is 1.7.0_04. I start the JVM with max 4G RAM
>>> (-Xmx4096m). I inserted a graph with 1M vertices and 70M edges, stored in
>>> two CSV files. I've observed that after inserting 30M edges
>>> the performance breaks down. For the first 30M edges it takes avg 2 seconds
>>> to insert 1M edge, after that it takes 40-60 second per 1M edges. What can
>>> be the cause of it and how can I improve this performance?
>>> Thanks!