First of all, ignoring for a second the problems that I'm going to describe, I must express my warmest kudos to those who created and contributed to Neo4J - it rocks. Both relatively - I compared it to OrientDB and Hypergraph, but also on the absolute scale - the API, the documentation, the performance, Cypher, the tools - simply brilliant. Thanks for creating such a useful and capable platform.
Now, unfortunately, on to problems: I've got a few datasets in one DB with total of 33M nodes, 46M relationships. The resulting DB size is 5GB on the file system and I'm wondering why is it so big? The initial dataset (XML) is 1GB - lots of redundant data, the actual "data" are at least half the size. In terms of how these data are stored, every node has a single property, some nodes (I'd say less than 10%) have 2 properties, and less than 1% have a bit more - all short strings.
Out of 5GB: - 288MB neostore.nodestore.db - 1500MB neostore.propertystore.db - 1463MB neostore.relationshipstore.db - 1890MB is Lucene index
I'm concerned that such a big DB on disk requires significant amount of memory for caching - it won't fit into physical memory so there will be lots of IO when queried live.
1. As a general request, I think it would be good to look at improving the way the data are stored - if possible of course. For example, being able to store numbers of different sizes (1 bit to 8 bytes), dates, 32-bit IDs, have secondary indexes for repetitive strings would be nice.
2. I'm trying to understand if there is anything I can do with the way how I construct the graph in order to reduce its size
For example, my property lengths vary but on average they are about 12 characters. Times 33M - roughly 400MB. How does it become 1500MB? How does Neo store properties? Interestingly, by looking at the property store file, I can't see the actual property values inside of it, it looks more like a map table. Are these references into Lucene? So the way to optimise this would be the reduce the number of properties? Is there a way to tell Lucene that I have lots of repetitive values (a column-based store with prefix encoding would have saved lots of space)?
For relationships, I can see that it's roughly 32 bytes per relationship - that's 4 longs. If node IDs are longs (is it possible to have ints?) then it's 2 nodes, plus another ID for name, plus flags - is that correct? So it's kind of no way to optimise, unless I reduce the number of relationships. Would be nice to have 32-bit IDs for future - not all datasets exceed 32-bit range.
3. Lucene index also seems to have lots of duplicates - I have lots of equal property values that the nodes are indexed by and also that they have as a property, so I can see repetitive words in the index. Is there something like secondary index - give these words an ID and then use that ID instead of the words? I could get away with less than 16bit for these IDs. Or a way to define "buckets" so that I can just append nodes into them without even specifying the values - all I need is to be able to iterate over the nodes in the same bucket?
Are there ways to fine-tune Lucene indexes without breaking Neo4J?
did you delete a lot of nodes/properties/rels when building up the dataset? If so then there might be free'd id's in your stores that could be compacted/reused.
In general node records use 9 bytes per node and relationship-records 33 byte per rel (which fits pretty directly with your store-sizes and #of nodes/rels)
Properties are stored in a packed way in 38 byte large blocks (at least one block per node/rel w/ properties) which try to inline numbers, arrays and strings as much as possible.
So here as well the block sizes aligns pretty well with your disk size by 38 bytes = #of nodes.
> First of all, ignoring for a second the problems that I'm going to describe, I must express my warmest kudos to those who created and contributed to Neo4J - it rocks. Both relatively - I compared it to OrientDB and Hypergraph, but also on the absolute scale - the API, the documentation, the performance, Cypher, the tools - simply brilliant. Thanks for creating such a useful and capable platform.
> Now, unfortunately, on to problems: I've got a few datasets in one DB with total of 33M nodes, 46M relationships. The resulting DB size is 5GB on the file system and I'm wondering why is it so big? The initial dataset (XML) is 1GB - lots of redundant data, the actual "data" are at least half the size. In terms of how these data are stored, every node has a single property, some nodes (I'd say less than 10%) have 2 properties, and less than 1% have a bit more - all short strings.
> Out of 5GB:
> - 288MB neostore.nodestore.db
> - 1500MB neostore.propertystore.db
> - 1463MB neostore.relationshipstore.db
> - 1890MB is Lucene index
> I'm concerned that such a big DB on disk requires significant amount of memory for caching - it won't fit into physical memory so there will be lots of IO when queried live.
> 1. As a general request, I think it would be good to look at improving the way the data are stored - if possible of course. For example, being able to store numbers of different sizes (1 bit to 8 bytes), dates, 32-bit IDs, have secondary indexes for repetitive strings would be nice.
> 2. I'm trying to understand if there is anything I can do with the way how I construct the graph in order to reduce its size
> For example, my property lengths vary but on average they are about 12 characters. Times 33M - roughly 400MB. How does it become 1500MB? How does Neo store properties? Interestingly, by looking at the property store file, I can't see the actual property values inside of it, it looks more like a map table. Are these references into Lucene? So the way to optimise this would be the reduce the number of properties? Is there a way to tell Lucene that I have lots of repetitive values (a column-based store with prefix encoding would have saved lots of space)?
> For relationships, I can see that it's roughly 32 bytes per relationship - that's 4 longs. If node IDs are longs (is it possible to have ints?) then it's 2 nodes, plus another ID for name, plus flags - is that correct? So it's kind of no way to optimise, unless I reduce the number of relationships. Would be nice to have 32-bit IDs for future - not all datasets exceed 32-bit range.
> 3. Lucene index also seems to have lots of duplicates - I have lots of equal property values that the nodes are indexed by and also that they have as a property, so I can see repetitive words in the index. Is there something like secondary index - give these words an ID and then use that ID instead of the words? I could get away with less than 16bit for these IDs. Or a way to define "buckets" so that I can just append nodes into them without even specifying the values - all I need is to be able to iterate over the nodes in the same bucket?
> Are there ways to fine-tune Lucene indexes without breaking Neo4J?
Thanks Michael, interesting articles. I did not delete anything during creation - it's been freshly created from scratch using BatchInserter.
From what I understand now, relationships are expensive, and so are properties - need to reduce the number of them if possible. Also, you do have compact storage for some types and for some strings, so I'll try to exploit that.
On Sunday, July 29, 2012 7:34:42 AM UTC+10, Michael Hunger wrote:
> Denis,
> did you delete a lot of nodes/properties/rels when building up the > dataset? If so then there might be free'd id's in your stores that could be > compacted/reused.
> Other than that there are some blog posts describing the internal > structure of neo4j records.
> In general node records use 9 bytes per node and relationship-records 33 > byte per rel (which fits pretty directly with your store-sizes and #of > nodes/rels)
> Properties are stored in a packed way in 38 byte large blocks (at least > one block per node/rel w/ properties) which try to inline numbers, arrays > and strings as much as possible. > So here as well the block sizes aligns pretty well with your disk size by > 38 bytes = #of nodes.
> HTH
> Michael
> Am 28.07.2012 um 13:31 schrieb Denis Mikhalkin:
> Hello,
> First of all, ignoring for a second the problems that I'm going to > describe, I must express my warmest kudos to those who created and > contributed to Neo4J - it rocks. Both relatively - I compared it to > OrientDB and Hypergraph, but also on the absolute scale - the API, the > documentation, the performance, Cypher, the tools - simply brilliant. > Thanks for creating such a useful and capable platform.
> Now, unfortunately, on to problems: I've got a few datasets in one DB with > total of 33M nodes, 46M relationships. The resulting DB size is 5GB on the > file system and I'm wondering why is it so big? The initial dataset (XML) > is 1GB - lots of redundant data, the actual "data" are at least half the > size. In terms of how these data are stored, every node has a single > property, some nodes (I'd say less than 10%) have 2 properties, and less > than 1% have a bit more - all short strings.
> Out of 5GB: > - 288MB neostore.nodestore.db > - 1500MB neostore.propertystore.db > - 1463MB neostore.relationshipstore.db > - 1890MB is Lucene index
> I'm concerned that such a big DB on disk requires significant amount of > memory for caching - it won't fit into physical memory so there will be > lots of IO when queried live.
> 1. As a general request, I think it would be good to look at improving the > way the data are stored - if possible of course. For example, being able to > store numbers of different sizes (1 bit to 8 bytes), dates, 32-bit IDs, > have secondary indexes for repetitive strings would be nice.
> 2. I'm trying to understand if there is anything I can do with the way how > I construct the graph in order to reduce its size
> For example, my property lengths vary but on average they are about 12 > characters. Times 33M - roughly 400MB. How does it become 1500MB? How does > Neo store properties? Interestingly, by looking at the property store file, > I can't see the actual property values inside of it, it looks more like a > map table. Are these references into Lucene? So the way to optimise this > would be the reduce the number of properties? Is there a way to tell Lucene > that I have lots of repetitive values (a column-based store with prefix > encoding would have saved lots of space)?
> For relationships, I can see that it's roughly 32 bytes per relationship - > that's 4 longs. If node IDs are longs (is it possible to have ints?) then > it's 2 nodes, plus another ID for name, plus flags - is that correct? So > it's kind of no way to optimise, unless I reduce the number of > relationships. Would be nice to have 32-bit IDs for future - not all > datasets exceed 32-bit range.
> 3. Lucene index also seems to have lots of duplicates - I have lots of > equal property values that the nodes are indexed by and also that they have > as a property, so I can see repetitive words in the index. Is there > something like secondary index - give these words an ID and then use that > ID instead of the words? I could get away with less than 16bit for these > IDs. Or a way to define "buckets" so that I can just append nodes into them > without even specifying the values - all I need is to be able to iterate > over the nodes in the same bucket?
> Are there ways to fine-tune Lucene indexes without breaking Neo4J?
Actually neither relationships nor properties are really expensive.
But it would be interesting to have more options for configuring default block sizes. E.g. if you know that you have only one property that fits into 8 bytes then the property-store-record could be much smaller. Same if you know that you never have relationships with properties. But this is not a general case, rather a custom optimization. Did you already run into issues with the store-size? What's more interesting is to get as many of the accessed nodes and rels into the 2nd level caches. If that's an issue for you then try to pre-load them with iterating over GlobalGraphOperations.at(gdb).getAllNodes() and GlobalGraphOperations.at(gdb).getAllRelationship() (or the appropriate cypher query)
> Thanks Michael, interesting articles. I did not delete anything during creation - it's been freshly created from scratch using BatchInserter.
> From what I understand now, relationships are expensive, and so are properties - need to reduce the number of them if possible. Also, you do have compact storage for some types and for some strings, so I'll try to exploit that.
> Thanks.
> Denis
> On Sunday, July 29, 2012 7:34:42 AM UTC+10, Michael Hunger wrote:
> Denis,
> did you delete a lot of nodes/properties/rels when building up the dataset? If so then there might be free'd id's in your stores that could be compacted/reused.
> In general node records use 9 bytes per node and relationship-records 33 byte per rel (which fits pretty directly with your store-sizes and #of nodes/rels)
> Properties are stored in a packed way in 38 byte large blocks (at least one block per node/rel w/ properties) which try to inline numbers, arrays and strings as much as possible.
> So here as well the block sizes aligns pretty well with your disk size by 38 bytes = #of nodes.
> HTH
> Michael
> Am 28.07.2012 um 13:31 schrieb Denis Mikhalkin:
>> Hello,
>> First of all, ignoring for a second the problems that I'm going to describe, I must express my warmest kudos to those who created and contributed to Neo4J - it rocks. Both relatively - I compared it to OrientDB and Hypergraph, but also on the absolute scale - the API, the documentation, the performance, Cypher, the tools - simply brilliant. Thanks for creating such a useful and capable platform.
>> Now, unfortunately, on to problems: I've got a few datasets in one DB with total of 33M nodes, 46M relationships. The resulting DB size is 5GB on the file system and I'm wondering why is it so big? The initial dataset (XML) is 1GB - lots of redundant data, the actual "data" are at least half the size. In terms of how these data are stored, every node has a single property, some nodes (I'd say less than 10%) have 2 properties, and less than 1% have a bit more - all short strings.
>> Out of 5GB:
>> - 288MB neostore.nodestore.db
>> - 1500MB neostore.propertystore.db
>> - 1463MB neostore.relationshipstore.db
>> - 1890MB is Lucene index
>> I'm concerned that such a big DB on disk requires significant amount of memory for caching - it won't fit into physical memory so there will be lots of IO when queried live.
>> 1. As a general request, I think it would be good to look at improving the way the data are stored - if possible of course. For example, being able to store numbers of different sizes (1 bit to 8 bytes), dates, 32-bit IDs, have secondary indexes for repetitive strings would be nice.
>> 2. I'm trying to understand if there is anything I can do with the way how I construct the graph in order to reduce its size
>> For example, my property lengths vary but on average they are about 12 characters. Times 33M - roughly 400MB. How does it become 1500MB? How does Neo store properties? Interestingly, by looking at the property store file, I can't see the actual property values inside of it, it looks more like a map table. Are these references into Lucene? So the way to optimise this would be the reduce the number of properties? Is there a way to tell Lucene that I have lots of repetitive values (a column-based store with prefix encoding would have saved lots of space)?
>> For relationships, I can see that it's roughly 32 bytes per relationship - that's 4 longs. If node IDs are longs (is it possible to have ints?) then it's 2 nodes, plus another ID for name, plus flags - is that correct? So it's kind of no way to optimise, unless I reduce the number of relationships. Would be nice to have 32-bit IDs for future - not all datasets exceed 32-bit range.
>> 3. Lucene index also seems to have lots of duplicates - I have lots of equal property values that the nodes are indexed by and also that they have as a property, so I can see repetitive words in the index. Is there something like secondary index - give these words an ID and then use that ID instead of the words? I could get away with less than 16bit for these IDs. Or a way to define "buckets" so that I can just append nodes into them without even specifying the values - all I need is to be able to iterate over the nodes in the same bucket?
>> Are there ways to fine-tune Lucene indexes without breaking Neo4J?
Is it still that Neo4j doesn't free space after removing data items? I've got an old (production) database which has grown from 300 MB to 5 GB since 2010. :-(
> Actually neither relationships nor properties are really expensive.
> But it would be interesting to have more options for configuring > default block sizes. E.g. if you know that you have only one property > that fits into 8 bytes then the property-store-record could be much > smaller. Same if you know that you never have relationships with > properties. But this is not a general case, rather a custom > optimization. Did you already run into issues with the store-size? > What's more interesting is to get as many of the accessed nodes and > rels into the 2nd level caches. If that's an issue for you then try to > pre-load them with iterating over GlobalGraphOperations.at > <http://GlobalGraphOperations.at>(gdb).getAllNodes() and > GlobalGraphOperations.at > <http://GlobalGraphOperations.at>(gdb).getAllRelationship() (or the > appropriate cypher query)
> Can you raise an github issue about this?
> Cheers
> Michael
> Am 29.07.2012 um 08:44 schrieb Denis Mikhalkin:
>> Thanks Michael, interesting articles. I did not delete anything >> during creation - it's been freshly created from scratch using >> BatchInserter.
>> From what I understand now, relationships are expensive, and so are >> properties - need to reduce the number of them if possible. Also, you >> do have compact storage for some types and for some strings, so I'll >> try to exploit that.
>> Thanks.
>> Denis
>> On Sunday, July 29, 2012 7:34:42 AM UTC+10, Michael Hunger wrote:
>> Denis,
>> did you delete a lot of nodes/properties/rels when building up
>> the dataset? If so then there might be free'd id's in your stores
>> that could be compacted/reused.
>> In general node records use 9 bytes per node and
>> relationship-records 33 byte per rel (which fits pretty directly
>> with your store-sizes and #of nodes/rels)
>> Properties are stored in a packed way in 38 byte large blocks (at
>> least one block per node/rel w/ properties) which try to inline
>> numbers, arrays and strings as much as possible.
>> So here as well the block sizes aligns pretty well with your disk
>> size by 38 bytes = #of nodes.
>> HTH
>> Michael
>> Am 28.07.2012 um 13:31 schrieb Denis Mikhalkin:
>>> Hello,
>>> First of all, ignoring for a second the problems that I'm going
>>> to describe, I must express my warmest kudos to those who
>>> created and contributed to Neo4J - it rocks. Both relatively - I
>>> compared it to OrientDB and Hypergraph, but also on the absolute
>>> scale - the API, the documentation, the performance, Cypher, the
>>> tools - simply brilliant. Thanks for creating such a useful and
>>> capable platform.
>>> Now, unfortunately, on to problems: I've got a few datasets in
>>> one DB with total of 33M nodes, 46M relationships. The
>>> resulting DB size is 5GB on the file system and I'm wondering
>>> why is it so big? The initial dataset (XML) is 1GB - lots of
>>> redundant data, the actual "data" are at least half the size. In
>>> terms of how these data are stored, every node has a single
>>> property, some nodes (I'd say less than 10%) have 2 properties,
>>> and less than 1% have a bit more - all short strings.
>>> Out of 5GB:
>>> - 288MB neostore.nodestore.db
>>> - 1500MB neostore.propertystore.db
>>> - 1463MB neostore.relationshipstore.db
>>> - 1890MB is Lucene index
>>> I'm concerned that such a big DB on disk requires significant
>>> amount of memory for caching - it won't fit into physical memory
>>> so there will be lots of IO when queried live.
>>> 1. As a general request, I think it would be good to look at
>>> improving the way the data are stored - if possible of course.
>>> For example, being able to store numbers of different sizes (1
>>> bit to 8 bytes), dates, 32-bit IDs, have secondary indexes for
>>> repetitive strings would be nice.
>>> 2. I'm trying to understand if there is anything I can do with
>>> the way how I construct the graph in order to reduce its size
>>> For example, my property lengths vary but on average they are
>>> about 12 characters. Times 33M - roughly 400MB. How does it
>>> become 1500MB? How does Neo store properties? Interestingly, by
>>> looking at the property store file, I can't see the actual
>>> property values inside of it, it looks more like a map table.
>>> Are these references into Lucene? So the way to optimise this
>>> would be the reduce the number of properties? Is there a way to
>>> tell Lucene that I have lots of repetitive values (a
>>> column-based store with prefix encoding would have saved lots of
>>> space)?
>>> For relationships, I can see that it's roughly 32 bytes per
>>> relationship - that's 4 longs. If node IDs are longs (is it
>>> possible to have ints?) then it's 2 nodes, plus another ID for
>>> name, plus flags - is that correct? So it's kind of no way to
>>> optimise, unless I reduce the number of relationships. Would be
>>> nice to have 32-bit IDs for future - not all datasets exceed
>>> 32-bit range.
>>> 3. Lucene index also seems to have lots of duplicates - I have
>>> lots of equal property values that the nodes are indexed by and
>>> also that they have as a property, so I can see repetitive words
>>> in the index. Is there something like secondary index - give
>>> these words an ID and then use that ID instead of the words? I
>>> could get away with less than 16bit for these IDs. Or a way to
>>> define "buckets" so that I can just append nodes into them
>>> without even specifying the values - all I need is to be able to
>>> iterate over the nodes in the same bucket?
>>> Are there ways to fine-tune Lucene indexes without breaking Neo4J?
Only after a restarted node and rel-id's are reused. Property blocks are reused directly.
Where to the 5GB live in which store-files?
It is pretty simple to copy the store into a new one, merging and compacting (nodes), rels and properties. I wrote one that used the batch-inserter for this, keeping node-id's.
Michael
see this one which also allows to filter no longer used properties and rel-types.
private static Set<String> splitOptionIfExists(String[] args, final int index) {
if (args.length <= index) return emptySet();
return new HashSet<String>(asList(args[index].toLowerCase().split(",")));
}
private static void copyStore(String sourceDir, String targetDir, Set<String> ignoreRelTypes, Set<String> ignoreProperties) throws Exception {
final File target = new File(targetDir);
final File source = new File(sourceDir);
if (target.exists()) throw new IllegalArgumentException("Target Directory already exists "+target);
if (!source.exists()) throw new IllegalArgumentException("Source Database does not exist "+source);
BatchInserter targetDb = new BatchInserterImpl(target.getAbsolutePath(),config());
GraphDatabaseService sourceDb = new EmbeddedGraphDatabase(sourceDir, config());
logs=new PrintWriter(new FileWriter(new File(target,"store-copy.log")));
> Is it still that Neo4j doesn't free space after removing data items? I've got an old (production) database which has grown from 300 MB to 5 GB since 2010. :-(
> Am 29.07.2012 11:05, schrieb Michael Hunger:
>> Actually neither relationships nor properties are really expensive.
>> But it would be interesting to have more options for configuring default block sizes. E.g. if you know that you have only one property that fits into 8 bytes then the property-store-record could be much smaller. Same if you know that you never have relationships with properties. But this is not a general case, rather a custom optimization. Did you already run into issues with the store-size? What's more interesting is to get as many of the accessed nodes and rels into the 2nd level caches. If that's an issue for you then try to pre-load them with iterating over GlobalGraphOperations.at(gdb).getAllNodes() and GlobalGraphOperations.at(gdb).getAllRelationship() (or the appropriate cypher query)
>> Can you raise an github issue about this?
>> Cheers
>> Michael
>> Am 29.07.2012 um 08:44 schrieb Denis Mikhalkin:
>>> Thanks Michael, interesting articles. I did not delete anything during creation - it's been freshly created from scratch using BatchInserter.
>>> From what I understand now, relationships are expensive, and so are properties - need to reduce the number of them if possible. Also, you do have compact storage for some types and for some strings, so I'll try to exploit that.
>>> Thanks.
>>> Denis
>>> On Sunday, July 29, 2012 7:34:42 AM UTC+10, Michael Hunger wrote:
>>> Denis,
>>> did you delete a lot of nodes/properties/rels when building up the dataset? If so then there might be free'd id's in your stores that could be compacted/reused.
>>> Other than that there are some blog posts describing the internal
Looking at http://3.bp.blogspot.com/__Sn-iXmVbEI/TLDLADnUwbI/AAAAAAAAADU/WoqsZHQ... (perhaps outdated but I hope still relevant) I'd say there are many other options including inlining of properties/relationships, nodes/relationships without properties, replacing empty IDs with an "absent" bit flag, taking into account adjacency of relationships, column-based property store, "packing" IDs and numbers, I'm sure other people will have more suggestions. I'll raise a github issue for this as requested.
I don't know whether I have a particular issue with store size, but I do see some slow performance which flattens after a number of similar queries which suggests disk caching (have not verified though) so I was thinking smaller DB size would certainly be faster for random queries. I've reduced the number of properties that I use and that shaved off 400MB, so I think I'll revisit the graph structure later once my queries are stable to remove unnecessary nodes/rels/props. Would be nice to have some help from Neo4J on this - something like "cold spots" report (or even a "mark unused" operation) which would highlight the parts of the structure (props, rels, nodes, indices) which are never ever going to be touched by a set of queries.
The option of pre-caching of all nodes/relationships would probably not work in the long run as my queries are spatial and for time so they have certain locality, and with not enough memory for the whole DB I hope it'll get cached naturally based on that locality. I'd rather have the full index cached, and some property columns as I need to perform range "where".
On Sunday, July 29, 2012 7:05:49 PM UTC+10, Michael Hunger wrote:
> Actually neither relationships nor properties are really expensive.
> But it would be interesting to have more options for configuring default > block sizes. E.g. if you know that you have only one property that fits > into 8 bytes then the property-store-record could be much smaller. Same if > you know that you never have relationships with properties. But this is not > a general case, rather a custom optimization. Did you already run into > issues with the store-size? What's more interesting is to get as many of > the accessed nodes and rels into the 2nd level caches. If that's an issue > for you then try to pre-load them with iterating over > GlobalGraphOperations.at(gdb).getAllNodes() and GlobalGraphOperations.at(gdb).getAllRelationship() > (or the appropriate cypher query)
> Can you raise an github issue about this?
> Cheers
> Michael
> Am 29.07.2012 um 08:44 schrieb Denis Mikhalkin:
> Thanks Michael, interesting articles. I did not delete anything during > creation - it's been freshly created from scratch using BatchInserter.
> From what I understand now, relationships are expensive, and so are > properties - need to reduce the number of them if possible. Also, you do > have compact storage for some types and for some strings, so I'll try to > exploit that.
> Thanks.
> Denis
> On Sunday, July 29, 2012 7:34:42 AM UTC+10, Michael Hunger wrote:
>> Denis,
>> did you delete a lot of nodes/properties/rels when building up the >> dataset? If so then there might be free'd id's in your stores that could be >> compacted/reused.
>> Other than that there are some blog posts describing the internal >> structure of neo4j records.
>> In general node records use 9 bytes per node and relationship-records 33 >> byte per rel (which fits pretty directly with your store-sizes and #of >> nodes/rels)
>> Properties are stored in a packed way in 38 byte large blocks (at least >> one block per node/rel w/ properties) which try to inline numbers, arrays >> and strings as much as possible. >> So here as well the block sizes aligns pretty well with your disk size by >> 38 bytes = #of nodes.
>> HTH
>> Michael
>> Am 28.07.2012 um 13:31 schrieb Denis Mikhalkin:
>> Hello,
>> First of all, ignoring for a second the problems that I'm going to >> describe, I must express my warmest kudos to those who created and >> contributed to Neo4J - it rocks. Both relatively - I compared it to >> OrientDB and Hypergraph, but also on the absolute scale - the API, the >> documentation, the performance, Cypher, the tools - simply brilliant. >> Thanks for creating such a useful and capable platform.
>> Now, unfortunately, on to problems: I've got a few datasets in one DB >> with total of 33M nodes, 46M relationships. The resulting DB size is 5GB >> on the file system and I'm wondering why is it so big? The initial dataset >> (XML) is 1GB - lots of redundant data, the actual "data" are at least half >> the size. In terms of how these data are stored, every node has a single >> property, some nodes (I'd say less than 10%) have 2 properties, and less >> than 1% have a bit more - all short strings.
>> Out of 5GB: >> - 288MB neostore.nodestore.db >> - 1500MB neostore.propertystore.db >> - 1463MB neostore.relationshipstore.db >> - 1890MB is Lucene index
>> I'm concerned that such a big DB on disk requires significant amount of >> memory for caching - it won't fit into physical memory so there will be >> lots of IO when queried live.
>> 1. As a general request, I think it would be good to look at improving >> the way the data are stored - if possible of course. For example, being >> able to store numbers of different sizes (1 bit to 8 bytes), dates, 32-bit >> IDs, have secondary indexes for repetitive strings would be nice.
>> 2. I'm trying to understand if there is anything I can do with the way >> how I construct the graph in order to reduce its size
>> For example, my property lengths vary but on average they are about 12 >> characters. Times 33M - roughly 400MB. How does it become 1500MB? How does >> Neo store properties? Interestingly, by looking at the property store file, >> I can't see the actual property values inside of it, it looks more like a >> map table. Are these references into Lucene? So the way to optimise this >> would be the reduce the number of properties? Is there a way to tell Lucene >> that I have lots of repetitive values (a column-based store with prefix >> encoding would have saved lots of space)?
>> For relationships, I can see that it's roughly 32 bytes per relationship >> - that's 4 longs. If node IDs are longs (is it possible to have ints?) then >> it's 2 nodes, plus another ID for name, plus flags - is that correct? So >> it's kind of no way to optimise, unless I reduce the number of >> relationships. Would be nice to have 32-bit IDs for future - not all >> datasets exceed 32-bit range.
>> 3. Lucene index also seems to have lots of duplicates - I have lots of >> equal property values that the nodes are indexed by and also that they have >> as a property, so I can see repetitive words in the index. Is there >> something like secondary index - give these words an ID and then use that >> ID instead of the words? I could get away with less than 16bit for these >> IDs. Or a way to define "buckets" so that I can just append nodes into them >> without even specifying the values - all I need is to be able to iterate >> over the nodes in the same bucket?
>> Are there ways to fine-tune Lucene indexes without breaking Neo4J?
Denis,
yes, some of what you suggest is already in like the inlining of small
properties. We are going to merge in better handling of dense nodes
after 1.8 GA also, and from there I think you can feel free to
experiment with more optimizations, comments and spikes much
appreciated!
<denismikhal...@gmail.com> wrote:
> Looking at
> http://3.bp.blogspot.com/__Sn-iXmVbEI/TLDLADnUwbI/AAAAAAAAADU/WoqsZHQ... > (perhaps outdated but I hope still relevant) I'd say there are many other
> options including inlining of properties/relationships, nodes/relationships
> without properties, replacing empty IDs with an "absent" bit flag, taking
> into account adjacency of relationships, column-based property store,
> "packing" IDs and numbers, I'm sure other people will have more suggestions.
> I'll raise a github issue for this as requested.
> I don't know whether I have a particular issue with store size, but I do see
> some slow performance which flattens after a number of similar queries which
> suggests disk caching (have not verified though) so I was thinking smaller
> DB size would certainly be faster for random queries. I've reduced the
> number of properties that I use and that shaved off 400MB, so I think I'll
> revisit the graph structure later once my queries are stable to remove
> unnecessary nodes/rels/props. Would be nice to have some help from Neo4J on
> this - something like "cold spots" report (or even a "mark unused"
> operation) which would highlight the parts of the structure (props, rels,
> nodes, indices) which are never ever going to be touched by a set of
> queries.
> The option of pre-caching of all nodes/relationships would probably not work
> in the long run as my queries are spatial and for time so they have certain
> locality, and with not enough memory for the whole DB I hope it'll get
> cached naturally based on that locality. I'd rather have the full index
> cached, and some property columns as I need to perform range "where".
> Thanks.
> Denis
> On Sunday, July 29, 2012 7:05:49 PM UTC+10, Michael Hunger wrote:
>> Actually neither relationships nor properties are really expensive.
>> But it would be interesting to have more options for configuring default
>> block sizes. E.g. if you know that you have only one property that fits into
>> 8 bytes then the property-store-record could be much smaller. Same if you
>> know that you never have relationships with properties. But this is not a
>> general case, rather a custom optimization. Did you already run into issues
>> with the store-size? What's more interesting is to get as many of the
>> accessed nodes and rels into the 2nd level caches. If that's an issue for
>> you then try to pre-load them with iterating over
>> GlobalGraphOperations.at(gdb).getAllNodes() and
>> GlobalGraphOperations.at(gdb).getAllRelationship() (or the appropriate
>> cypher query)
>> Can you raise an github issue about this?
>> Cheers
>> Michael
>> Am 29.07.2012 um 08:44 schrieb Denis Mikhalkin:
>> Thanks Michael, interesting articles. I did not delete anything during
>> creation - it's been freshly created from scratch using BatchInserter.
>> From what I understand now, relationships are expensive, and so are
>> properties - need to reduce the number of them if possible. Also, you do
>> have compact storage for some types and for some strings, so I'll try to
>> exploit that.
>> Thanks.
>> Denis
>> On Sunday, July 29, 2012 7:34:42 AM UTC+10, Michael Hunger wrote:
>>> Denis,
>>> did you delete a lot of nodes/properties/rels when building up the
>>> dataset? If so then there might be free'd id's in your stores that could be
>>> compacted/reused.
>>> Other than that there are some blog posts describing the internal
>>> structure of neo4j records.
>>> In general node records use 9 bytes per node and relationship-records 33
>>> byte per rel (which fits pretty directly with your store-sizes and #of
>>> nodes/rels)
>>> Properties are stored in a packed way in 38 byte large blocks (at least
>>> one block per node/rel w/ properties) which try to inline numbers, arrays
>>> and strings as much as possible.
>>> So here as well the block sizes aligns pretty well with your disk size by
>>> 38 bytes = #of nodes.
>>> HTH
>>> Michael
>>> Am 28.07.2012 um 13:31 schrieb Denis Mikhalkin:
>>> Hello,
>>> First of all, ignoring for a second the problems that I'm going to
>>> describe, I must express my warmest kudos to those who created and
>>> contributed to Neo4J - it rocks. Both relatively - I compared it to OrientDB
>>> and Hypergraph, but also on the absolute scale - the API, the documentation,
>>> the performance, Cypher, the tools - simply brilliant. Thanks for creating
>>> such a useful and capable platform.
>>> Now, unfortunately, on to problems: I've got a few datasets in one DB
>>> with total of 33M nodes, 46M relationships. The resulting DB size is 5GB on
>>> the file system and I'm wondering why is it so big? The initial dataset
>>> (XML) is 1GB - lots of redundant data, the actual "data" are at least half
>>> the size. In terms of how these data are stored, every node has a single
>>> property, some nodes (I'd say less than 10%) have 2 properties, and less
>>> than 1% have a bit more - all short strings.
>>> Out of 5GB:
>>> - 288MB neostore.nodestore.db
>>> - 1500MB neostore.propertystore.db
>>> - 1463MB neostore.relationshipstore.db
>>> - 1890MB is Lucene index
>>> I'm concerned that such a big DB on disk requires significant amount of
>>> memory for caching - it won't fit into physical memory so there will be lots
>>> of IO when queried live.
>>> 1. As a general request, I think it would be good to look at improving
>>> the way the data are stored - if possible of course. For example, being able
>>> to store numbers of different sizes (1 bit to 8 bytes), dates, 32-bit IDs,
>>> have secondary indexes for repetitive strings would be nice.
>>> 2. I'm trying to understand if there is anything I can do with the way
>>> how I construct the graph in order to reduce its size
>>> For example, my property lengths vary but on average they are about 12
>>> characters. Times 33M - roughly 400MB. How does it become 1500MB? How does
>>> Neo store properties? Interestingly, by looking at the property store file,
>>> I can't see the actual property values inside of it, it looks more like a
>>> map table. Are these references into Lucene? So the way to optimise this
>>> would be the reduce the number of properties? Is there a way to tell Lucene
>>> that I have lots of repetitive values (a column-based store with prefix
>>> encoding would have saved lots of space)?
>>> For relationships, I can see that it's roughly 32 bytes per relationship
>>> - that's 4 longs. If node IDs are longs (is it possible to have ints?) then
>>> it's 2 nodes, plus another ID for name, plus flags - is that correct? So
>>> it's kind of no way to optimise, unless I reduce the number of
>>> relationships. Would be nice to have 32-bit IDs for future - not all
>>> datasets exceed 32-bit range.
>>> 3. Lucene index also seems to have lots of duplicates - I have lots of
>>> equal property values that the nodes are indexed by and also that they have
>>> as a property, so I can see repetitive words in the index. Is there
>>> something like secondary index - give these words an ID and then use that ID
>>> instead of the words? I could get away with less than 16bit for these IDs.
>>> Or a way to define "buckets" so that I can just append nodes into them
>>> without even specifying the values - all I need is to be able to iterate
>>> over the nodes in the same bucket?
>>> Are there ways to fine-tune Lucene indexes without breaking Neo4J?