What do propertystore.db and propertystore.db.strings store exactly?

200 views
Skip to first unread message

Zongheng Yang

unread,
Jul 16, 2015, 2:01:54 PM7/16/15
to ne...@googlegroups.com
Hi all,

Quick question: what do propertystore.db and propertystore.db.strings store, respectively?

My CSV headers look like these:

edges -- :START_ID, :END_ID, :TYPE, timestamp:LONG, attr
nodes -- :ID, name0, ..., name39

And propertystore.db totals 10GB, propertystore.db.strings totals 17GB.  I did a quick calculation, assuming those two files store serialized JVM Strings, all the node properties should total 6GB in memory, and all the edge properties should total 17GB in memory -- the first number doesn't match the size of propertystore.db, so I am a bit confused.

Thanks in advance,
Zongheng

Chris Vest

unread,
Jul 17, 2015, 8:56:11 AM7/17/15
to ne...@googlegroups.com
The propertystore.db file also has metadata about which entities a property belongs to, what the property names are, what type the value of a property has, and where to find the property values in cases where those are stored in other files such as propertystore.db.strings.

--
Chris Vest
System Engineer, Neo Technology
[ skype: mr.chrisvest, twitter: chvest ]


--
You received this message because you are subscribed to the Google Groups "Neo4j" group.
To unsubscribe from this group and stop receiving emails from it, send an email to neo4j+un...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Zongheng Yang

unread,
Jul 17, 2015, 11:20:08 AM7/17/15
to ne...@googlegroups.com
Thanks!  Some follow-ups: my current understanding is that,

(1) propertystore.db:  metadata (as you mentioned) + possibly inlined short strings/fields [otherwise, pointers]
(2) propertystore.db.strings:  long string properties

Does this sound right?  Also, node properties & relationship properties are interleaved and stored together in these files, right?

Lastly -- is everything in (1) and (2) deserialized JVM objects in raw bytes *or* just UTF-8 characters?  It could make a difference, since if neo4j needs to create a new String object out of the bytes read from these files, then the memory footprint could be larger than the on-disk file size due to object overhead.

Cheers,
Zongheng

You received this message because you are subscribed to a topic in the Google Groups "Neo4j" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/neo4j/AUXL-Y8OTfw/unsubscribe.
To unsubscribe from this group and all its topics, send an email to neo4j+un...@googlegroups.com.

Chris Vest

unread,
Jul 18, 2015, 6:05:15 AM7/18/15
to ne...@googlegroups.com
Does this sound right?  Also, node properties & relationship properties are interleaved and stored together in these files, right?

Yes and yes.

Lastly -- is everything in (1) and (2) deserialized JVM objects in raw bytes *or* just UTF-8 characters?  It could make a difference, since if neo4j needs to create a new String object out of the bytes read from these files, then the memory footprint could be larger than the on-disk file size due to object overhead.

We use a dozen different encodings depending on the contents of the given string. It’s not like compression, but it does reduce space usage in many cases. The embedded API deals in String objects, so we have to serialise and deserialise to support that. If you set cache_type=none, then the overhead of the String objects should be low as there would be a lot fewer of them.


--
Chris Vest
System Engineer, Neo Technology
[ skype: mr.chrisvest, twitter: chvest ]


Zongheng Yang

unread,
Jul 18, 2015, 3:20:43 PM7/18/15
to ne...@googlegroups.com
If those are serialized String objects then I'm seeing the following mismatch between measurement and calculation:

The graph I'm using has 4.9 million nodes, each of which has 40 string properties (each of which has 16 characters).  It has 70 million directed edges, each of which has 1 string property with 140 characters which.

Assuming JVM String objects incur a 2x overhead, then the total in-memory size of these properties are: (40*16*4.9*10^6 + 70*10^6*140) * 2 / 2^30 = 24 GB.  This roughly matches the on-disk footprint:

10G     neostore.propertystore.db
17G     neostore.propertystore.db.strings

So I think this matches Chris' explanation well (these two store files are serialized String objects).

However, after this warmup [1] to load the whole graph including node & relationship properties, the JVM heap memory usage is: Max 68.6 GB, Allocated 67.6 GB, Used 55.9 GB. 

Where does this mismatch (56GB vs. < 30GB) come from?  What's wrong in my calculation & understanding?  It cannot be the other stores (node / relationship) as `du -shc *store.db*` returns 29GB total on-disk, 27GB of which are the properties.


Any help would be appreciated!

Zongheng

Mattias Persson

unread,
Jul 21, 2015, 5:03:58 AM7/21/15
to ne...@googlegroups.com
To clarify, it's not serialized String objects. Neo4j stores the character data, either compacted by being able to use a smaller charset than utf-8/ascii so that fewer bits per character is required, or if string is "long" by neo4j measures, as plain characters into the neostore.propertystore.db.strings store. The long strings may have much bigger overhead, since the character data is quantized into 60/120 byte records. That's probably the inflation that you're seeing.

Zongheng Yang

unread,
Jul 21, 2015, 4:27:40 PM7/21/15
to ne...@googlegroups.com
Thanks, Matthias.  If I understand you correctly, the result of my previous *2 calculation roughly matching the two store files on disk is not due to JVM String overhead, rather it is due to Neo4j's quantization overhead.

There's one loose end that I wish to tighten: why does ~27GB of string characters -- which already contain wasted quantized bytes --  become ~56GB on JVM heap?  My only hypothesis is that these wasted bytes used in quantization somehow also roughly incur a *2 overhead, just as the useful bytes do when turned into String objects.  Is this right? If so, why do the useless bytes incur this overhead?

Thanks,
Zongheng

Mattias Persson

unread,
Jul 22, 2015, 3:25:34 AM7/22/15
to Neo4j Development
On Tue, Jul 21, 2015 at 10:27 PM, Zongheng Yang <zongh...@gmail.com> wrote:
Thanks, Matthias.  If I understand you correctly, the result of my previous *2 calculation roughly matching the two store files on disk is not due to JVM String overhead, rather it is due to Neo4j's quantization overhead.

There's one loose end that I wish to tighten: why does ~27GB of string characters -- which already contain wasted quantized bytes --  become ~56GB on JVM heap?  My only hypothesis is that these wasted bytes used in quantization somehow also roughly incur a *2 overhead, just as the useful bytes do when turned into String objects.  Is this right? If so, why do the useless bytes incur this overhead?

Yes when loaded (in 2.2) such properties will be kept as String objects, that's correct... so that's probably responsible for the 2* overhead. Sorry, I thought you talked about storage overhead. Anyways, that's how 2.2 works. 2.3 will have reduced heap overhead since it will have no "object" caching like that.




--
Mattias Persson
Neo4j Hacker at Neo Technology

Zongheng Yang

unread,
Jul 27, 2015, 1:53:57 AM7/27/15
to Neo4j Development
Got it, thanks for clearing things up!  I guess setting cache_type to 'none' does the trick for now (albeit incurring a higher query latency). 
Reply all
Reply to author
Forward
0 new messages