Account Options

  1. Sign in
The old Google Groups will be going away soon, but your browser is incompatible with the new version.
Google Groups Home
« Groups Home
Physical DB size
There are currently too many topics in this group that display first. To make this topic appear first, remove this option from another topic.
There was an error processing your request. Please try again.
flag
  8 messages - Collapse all  -  Translate all to Translated (View all originals)
The group you are posting to is a Usenet group. Messages posted to this group will make your email address visible to anyone on the Internet.
Your reply message has not been sent.
Your post was successful
 
From:
To:
Cc:
Followup To:
Add Cc | Add Followup-to | Edit Subject
Subject:
Validation:
For verification purposes please type the characters you see in the picture below or the numbers you hear by clicking the accessibility icon. Listen and type the numbers you hear
 
Denis Mikhalkin  
View profile  
 More options Jul 28 2012, 7:31 am
From: Denis Mikhalkin <denismikhal...@gmail.com>
Date: Sat, 28 Jul 2012 04:31:04 -0700 (PDT)
Local: Sat, Jul 28 2012 7:31 am
Subject: Physical DB size

Hello,

First of all, ignoring for a second the problems that I'm going to
describe, I must express my warmest kudos to those who created and
contributed to Neo4J - it rocks. Both relatively - I compared it to
OrientDB and Hypergraph, but also on the absolute scale - the API, the
documentation, the performance, Cypher, the tools - simply brilliant.
Thanks for creating such a useful and capable platform.

Now, unfortunately, on to problems: I've got a few datasets in one DB with
total of 33M nodes, 46M relationships.  The resulting DB size is 5GB on the
file system and I'm wondering why is it so big? The initial dataset (XML)
is 1GB - lots of redundant data, the actual "data" are at least half the
size. In terms of how these data are stored, every node has a single
property, some nodes (I'd say less than 10%) have 2 properties, and less
than 1% have a bit more - all short strings.

Out of 5GB:
- 288MB neostore.nodestore.db
- 1500MB neostore.propertystore.db
- 1463MB neostore.relationshipstore.db
- 1890MB is Lucene index

I'm concerned that such a big DB on disk requires significant amount of
memory for caching - it won't fit into physical memory so there will be
lots of IO when queried live.

1. As a general request, I think it would be good to look at improving the
way the data are stored - if possible of course. For example, being able to
store numbers of different sizes (1 bit to 8 bytes), dates, 32-bit IDs,
have secondary indexes for repetitive strings would be nice.

2. I'm trying to understand if there is anything I can do with the way how
I construct the graph in order to reduce its size

For example, my property lengths vary but on average they are about 12
characters. Times 33M - roughly 400MB. How does it become 1500MB? How does
Neo store properties? Interestingly, by looking at the property store file,
I can't see the actual property values inside of it, it looks more like a
map table. Are these references into Lucene? So the way to optimise this
would be the reduce the number of properties? Is there a way to tell Lucene
that I have lots of repetitive values (a column-based store with prefix
encoding would have saved lots of space)?

For relationships, I can see that it's roughly 32 bytes per relationship -
that's 4 longs. If node IDs are longs (is it possible to have ints?) then
it's 2 nodes, plus another ID for name, plus flags - is that correct? So
it's kind of no way to optimise, unless I reduce the number of
relationships. Would be nice to have 32-bit IDs for future - not all
datasets exceed 32-bit range.

3. Lucene index also seems to have lots of duplicates - I have lots of
equal property values that the nodes are indexed by and also that they have
as a property, so I can see repetitive words in the index. Is there
something like secondary index - give these words an ID and then use that
ID instead of the words? I could get away with less than 16bit for these
IDs. Or a way to define "buckets" so that I can just append nodes into them
without even specifying the values - all I need is to be able to iterate
over the nodes in the same bucket?

Are there ways to fine-tune Lucene indexes without breaking Neo4J?

Would appreciate any suggestions.

Thanks.

Denis


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Michael Hunger  
View profile  
 More options Jul 28 2012, 5:34 pm
From: Michael Hunger <michael.hun...@neotechnology.com>
Date: Sat, 28 Jul 2012 23:34:42 +0200
Local: Sat, Jul 28 2012 5:34 pm
Subject: Re: [Neo4j] Physical DB size

Denis,

did you delete a lot of nodes/properties/rels when building up the dataset? If so then there might be free'd id's in your stores that could be compacted/reused.

Other than that there are some blog posts describing the internal structure of neo4j records.
http://digitalstain.blogspot.de/2011/11/rooting-out-redundancy-new-ne...
http://digitalstain.blogspot.de/2010/10/neo4j-internals-file-storage....
http://journal.thobe.org/2011/02/better-support-for-short-strings-in....

In general node records use 9 bytes per node and relationship-records 33 byte per rel (which fits pretty directly with your store-sizes and #of nodes/rels)

Properties are stored in a packed way in 38 byte large blocks (at least one block per node/rel w/ properties) which try to inline numbers, arrays and strings as much as possible.
So here as well the block sizes aligns pretty well with your disk size by 38 bytes = #of nodes.

HTH

Michael

Am 28.07.2012 um 13:31 schrieb Denis Mikhalkin:


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Denis Mikhalkin  
View profile  
 More options Jul 29 2012, 2:44 am
From: Denis Mikhalkin <denismikhal...@gmail.com>
Date: Sat, 28 Jul 2012 23:44:54 -0700 (PDT)
Local: Sun, Jul 29 2012 2:44 am
Subject: Re: [Neo4j] Physical DB size

Thanks Michael, interesting articles. I did not delete anything during
creation - it's been freshly created from scratch using BatchInserter.

From what I understand now, relationships are expensive, and so are
properties - need to reduce the number of them if possible. Also, you do
have compact storage for some types and for some strings, so I'll try to
exploit that.

Thanks.

Denis


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Michael Hunger  
View profile  
 More options Jul 29 2012, 5:05 am
From: Michael Hunger <michael.hun...@neotechnology.com>
Date: Sun, 29 Jul 2012 11:05:49 +0200
Local: Sun, Jul 29 2012 5:05 am
Subject: Re: [Neo4j] Physical DB size

Actually neither relationships nor properties are really expensive.

But it would be interesting to have more options for configuring default block sizes. E.g. if you know that you have only one property that fits into 8 bytes then the property-store-record could be much smaller. Same if you know that you never have relationships with properties. But this is not a general case, rather a custom optimization. Did you already run into issues with the store-size? What's more interesting is to get as many of the accessed nodes and rels into the 2nd level caches. If that's an issue for you then try to pre-load them with iterating over GlobalGraphOperations.at(gdb).getAllNodes() and GlobalGraphOperations.at(gdb).getAllRelationship() (or the appropriate cypher query)

Can you raise an github issue about this?

Cheers

Michael

Am 29.07.2012 um 08:44 schrieb Denis Mikhalkin:


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Axel Morgner  
View profile  
 More options Jul 29 2012, 5:38 am
From: Axel Morgner <a...@morgner.de>
Date: Sun, 29 Jul 2012 11:38:39 +0200
Local: Sun, Jul 29 2012 5:38 am
Subject: Re: [Neo4j] Physical DB size

Is it still that Neo4j doesn't free space after removing data items?
I've got an old (production) database which has grown from 300 MB to 5
GB since 2010. :-(

See this conversation from 2010:
http://lists.neo4j.org/pipermail/user/2010-November/005478.html
(please note that the code there is fairly outdated, doesn't copy
relationship properties)

Am 29.07.2012 11:05, schrieb Michael Hunger:

--

Axel Morgner � a...@morgner.de � @amorgner

c/o Morgner UG � Hanauer Landstr. 291a � 60314 Frankfurt � Germany
phone: +49 151 40522060 � skype: axel.morgner

http://structr.org
http://www.meetup.com/graphdb-frankfurt
https://splink.de


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Michael Hunger  
View profile  
 More options Jul 29 2012, 10:13 am
From: Michael Hunger <michael.hun...@neotechnology.com>
Date: Sun, 29 Jul 2012 16:13:43 +0200
Local: Sun, Jul 29 2012 10:13 am
Subject: Re: [Neo4j] Physical DB size

Only after a restarted node and rel-id's are reused. Property blocks are reused directly.

Where to the 5GB live in which store-files?

It is pretty simple to copy the store into a new one, merging and compacting (nodes), rels and properties. I wrote one that used the batch-inserter for this, keeping node-id's.

Michael

see this one which also allows to filter no longer used properties and rel-types.

package org.neo4j.tool;

import org.apache.commons.io.FileUtils;
import org.neo4j.graphdb.*;
import org.neo4j.helpers.collection.MapUtil;
import org.neo4j.kernel.EmbeddedGraphDatabase;
import org.neo4j.kernel.impl.batchinsert.BatchInserter;
import org.neo4j.kernel.impl.batchinsert.BatchInserterImpl;
import org.neo4j.kernel.impl.nioneo.store.InvalidRecordException;

import java.io.File;
import java.io.FileWriter;
import java.io.IOException;
import java.io.PrintWriter;
import java.util.*;

import static java.util.Arrays.asList;
import static java.util.Collections.emptySet;

public class StoreCopy {

    private static PrintWriter logs;

    @SuppressWarnings("unchecked")
    public static Map<String,String> config() {
        return (Map)MapUtil.map(
                "neostore.nodestore.db.mapped_memory", "100M",
                "neostore.relationshipstore.db.mapped_memory", "500M",
                "neostore.propertystore.db.mapped_memory", "300M",
                "neostore.propertystore.db.strings.mapped_memory", "1G",
                "neostore.propertystore.db.arrays.mapped_memory", "300M",
                "neostore.propertystore.db.index.keys.mapped_memory", "100M",
                "neostore.propertystore.db.index.mapped_memory", "100M",
                "cache_type", "weak"
        );
    }
    public static void main(String[] args) throws Exception {
        if (args.length < 2) {
            System.err.println("Usage: StoryCopy source target [rel,types,to,ignore] [properties,to,ignore]");
            return;
        }
        String sourceDir=args[0];
        String targetDir=args[1];
        Set<String> ignoreRelTypes= splitOptionIfExists(args, 2);
        Set<String> ignoreProperties= splitOptionIfExists(args,3);
        System.out.printf("Copying from %s to %s ingoring rel-types %s ignoring properties %s %n", sourceDir, targetDir, ignoreRelTypes, ignoreProperties);
        copyStore(sourceDir,targetDir,ignoreRelTypes,ignoreProperties);
    }

    private static Set<String> splitOptionIfExists(String[] args, final int index) {
        if (args.length <= index) return emptySet();
        return new HashSet<String>(asList(args[index].toLowerCase().split(",")));
    }

    private static void copyStore(String sourceDir, String targetDir, Set<String> ignoreRelTypes, Set<String> ignoreProperties) throws Exception {
        final File target = new File(targetDir);
        final File source = new File(sourceDir);
        if (target.exists()) throw new IllegalArgumentException("Target Directory already exists "+target);
        if (!source.exists()) throw new IllegalArgumentException("Source Database does not exist "+source);

        BatchInserter targetDb = new BatchInserterImpl(target.getAbsolutePath(),config());
        GraphDatabaseService sourceDb = new EmbeddedGraphDatabase(sourceDir, config());
        logs=new PrintWriter(new FileWriter(new File(target,"store-copy.log")));

        copyNodes(sourceDb, targetDb, ignoreProperties);
        copyRelationships(sourceDb, targetDb, ignoreRelTypes,ignoreProperties);

        targetDb.shutdown();
        sourceDb.shutdown();
        logs.close();
        copyIndex(source, target);
    }

    private static void copyIndex(File source, File target) throws IOException {
        final File indexFile = new File(source, "index.db");
        if (indexFile.exists()) {
            FileUtils.copyFile(indexFile, new File(target, "index.db"));
        }
        final File indexDir = new File(source, "index");
        if (indexDir.exists()) {
            FileUtils.copyDirectory(indexDir, new File(target, "index"));
        }
    }

    private static void copyRelationships(GraphDatabaseService sourceDb, BatchInserter targetDb, Set<String> ignoreRelTypes, Set<String> ignoreProperties) {
        long time = System.currentTimeMillis();
        int count=0;
        for (Node node : sourceDb.getAllNodes()) {
            for (Relationship rel : getOutgoingRelationships(node)) {
                if (ignoreRelTypes.contains(rel.getType().name().toLowerCase())) continue;
                createRelationship(targetDb, rel, ignoreProperties);
                count ++;
                if (count % 1000 == 0) System.out.print(".");
                if (count % 100000 == 0) System.out.println(" " + count);
            }
        }
        System.out.println("\n copying of " + count+ " relationships took "+(System.currentTimeMillis()-time)+" ms.");
    }

    private static void createRelationship(BatchInserter targetDb, Relationship rel, Set<String> ignoreProperties) {
        long startNodeId=rel.getStartNode().getId();
        long endNodeId=rel.getEndNode().getId();
        final RelationshipType type = rel.getType();
        try {
            targetDb.createRelationship(startNodeId,endNodeId , type, getProperties(rel, ignoreProperties));
        } catch (InvalidRecordException ire) {
            addLog(rel,"create Relationship: "+startNodeId+"-[:"+type+"]"+"->"+endNodeId,ire.getMessage());
        }
    }

    private static Iterable<Relationship> getOutgoingRelationships(Node node) {
        try {
            return node.getRelationships(Direction.OUTGOING);
        } catch(InvalidRecordException ire) {
            addLog(node,"outgoingRelationships",ire.getMessage());
            return Collections.emptyList();
        }
    }

    private static void copyNodes(GraphDatabaseService sourceDb, BatchInserter targetDb, Set<String> ignoreProperties) {
        final Node refNode = sourceDb.getReferenceNode();
        long time = System.currentTimeMillis();
        int count=0;
        for (Node node : sourceDb.getAllNodes()) {
            if (node.equals(refNode)) {
                targetDb.setNodeProperties(targetDb.getReferenceNode(),getProperties(node,i gnoreProperties));
            } else {
                targetDb.createNode(node.getId(), getProperties(node, ignoreProperties));
            }
            count++;
            if (count % 1000 == 0) System.out.print(".");
            if (count % 100000 == 0) {
                logs.flush();
                System.out.println(" " + count);
            }
        }
        System.out.println("\n copying of " + count+ " nodes took "+(System.currentTimeMillis()-time)+" ms.");
    }

    private static Map<String, Object> getProperties(PropertyContainer pc, Set<String> ignoreProperties) {
        Map<String,Object> result=new HashMap<String, Object>();
        for (String property : getPropertyKeys(pc)) {
            if (ignoreProperties.contains(property.toLowerCase())) continue;
            try {
                result.put(property,pc.getProperty(property));
            } catch(InvalidRecordException ire) {
                addLog(pc, property, ire.getMessage());
            }
        }
        return result;
    }

    private static Iterable<String> getPropertyKeys(PropertyContainer pc) {
        try {
            return pc.getPropertyKeys();
        } catch(InvalidRecordException ire) {
            addLog(pc,"propertyKeys",ire.getMessage());
            return Collections.emptyList();
        }
    }

    private static void addLog(PropertyContainer pc, String property, String message) {
        logs.append(String.format("%s.%s %s%n",pc,property,message));
    }

}

Am 29.07.2012 um 11:38 schrieb Axel Morgner:

...

read more »


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Denis Mikhalkin  
View profile  
 More options Jul 31 2012, 8:42 am
From: Denis Mikhalkin <denismikhal...@gmail.com>
Date: Tue, 31 Jul 2012 05:42:24 -0700 (PDT)
Local: Tues, Jul 31 2012 8:42 am
Subject: Re: [Neo4j] Physical DB size

Looking
at http://3.bp.blogspot.com/__Sn-iXmVbEI/TLDLADnUwbI/AAAAAAAAADU/WoqsZHQ... (perhaps
outdated but I hope still relevant) I'd say there are many other options
including inlining of properties/relationships, nodes/relationships without
properties, replacing empty IDs with an "absent" bit flag, taking into
account adjacency of relationships, column-based property store, "packing"
IDs and numbers, I'm sure other people will have more suggestions. I'll
raise a github issue for this as requested.

I don't know whether I have a particular issue with store size, but I do
see some slow performance which flattens after a number of similar queries
which suggests disk caching (have not verified though) so I was thinking
smaller DB size would certainly be faster for random queries. I've reduced
the number of properties that I use and that shaved off 400MB, so I think
I'll revisit the graph structure later once my queries are stable to remove
unnecessary nodes/rels/props. Would be nice to have some help from Neo4J on
this - something like "cold spots" report (or even a "mark unused"
operation) which would highlight the parts of the structure (props, rels,
nodes, indices) which are never ever going to be touched by a set of
queries.

The option of pre-caching of all nodes/relationships would probably not
work in the long run as my queries are spatial and for time so they have
certain locality, and with not enough memory for the whole DB I hope it'll
get cached naturally based on that locality. I'd rather have the full index
cached, and some property columns as I need to perform range "where".

Thanks.

Denis


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Peter Neubauer  
View profile  
 More options Aug 1 2012, 8:24 am
From: Peter Neubauer <peter.neuba...@neotechnology.com>
Date: Wed, 1 Aug 2012 14:24:08 +0200
Local: Wed, Aug 1 2012 8:24 am
Subject: Re: [Neo4j] Physical DB size
Denis,
yes, some of what you suggest is already in like the inlining of small
properties. We are going to merge in better handling of dense nodes
after 1.8 GA also, and from there I think you can feel free to
experiment with more optimizations, comments and spikes much
appreciated!

Cheers,

/peter neubauer

G:  neubauer.peter
S:  peter.neubauer
P:  +46 704 106975
L:   http://www.linkedin.com/in/neubauer
T:   @peterneubauer

Wanna learn something new? Come to @graphconnect.

On Tue, Jul 31, 2012 at 2:42 PM, Denis Mikhalkin


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
End of messages
« Back to Discussions « Newer topic     Older topic »