OrientDB Storage Overhead

154 views
Skip to first unread message

Eric24

unread,
May 27, 2016, 10:06:00 AM5/27/16
to OrientDB
I have a few questions about storage overhead in ODB. If this is in the documentation somewhere, I've not been able to find it.
  1. What is the overhead (in bytes) to store any document (V or E), regardless of any property data (and not including indices, if any)?
  2. If non-mandatory properties are defined in the schema but not created/stored, is there any per-document overhead for those properties?
  3. For a given property type (BYTE, SHORT, INTEGER, STRING, etc.) what is the overhead for each (i.e. a SHORT is two bytes, but how many actual bytes are written to storage for each SHORT in a given document)?
  4. Are there any best-practices to minimize storage overhead (obviously, using the smallest property type for the job is key, but beyond that)?
I'm asking because I have an application that needs to store a very large number (billions) of relatively small documents (i.e. they have only three or four SHORT or INTEGER values), so storage overhead becomes very important in planning for system scalability.

My current database architecture uses lightweight links between the "root/owner" vertexes (of which there are only hundreds of thousands) and the "data" vertexes (of which there will be billions over time). I had also considered doing this using embedded documents, except that "data" vertexes will sometimes (often) need to link to other "data" vertexes, not just from the "root" to the "data", so using lightweight edges seemed like a better approach. If anyone has any insight or comments on this, I'd love to hear them.

--Eric

scott molinari

unread,
May 28, 2016, 2:13:55 AM5/28/16
to orient-...@googlegroups.com
I can't help much, but I do remember reading that the records are padded with space. You can find that info here (towards the bottom). 


I know this kind of "pre-allocation" technique is necessary to allow for flexible schema i.e. adding properties to records later on or updating records with more data than was there before. As I understand the reason for record "pre-allocation", it is needed because, if the space taken by the record would be exactly the size of the record, then adding data to it (making the record size larger) would cause the database to have to move the record on disk, instead of updating it directly. You can imagine, if you then update a lot of records this way, you'd end up with a huge mess fast and the database would slow down considerably. So, in order to avoid that, the database pre-allocates space per record. ODB has the setting RECORD_GROW_FACTOR. In MongoDB, they recommend and set as default what they call "powersOfTwo". In other words, the database doubles the initial size of the document on disk. This is what is explained in the example in the docs.

As I take it from the docs, the settings for record size can be changed through configuration. If you know your record size will never change, you could drop the values to "1". However, I could imagine, if you do that and then you do update and increase the data size even a little in a good number of records, that will not jive well with the database. Though, I am no expert on that. 

I'd also like to know the overhead values of the data types otherwise. Would be great basic knowledge of the database. If one of the nice gents from Orient would lay it out here, I'd be even glad to add it to the documentation. It would be a great addition to this table: http://orientdb.com/docs/latest/Types.html

Scott
 

Eric24

unread,
May 28, 2016, 11:56:10 AM5/28/16
to OrientDB
Thanks Scott. The RECORD* params definitely address the discrepancy between the data written and my observed disk space growth (I left them at their defaults, which the docs say is 1.2). So, setting these to 1 would essentially remove any fluff from the record. That's potentially good, and since it's settable on a per-cluster basis, is something that can be easily left to the database architect, based on their knowledge of how a particular class/cluster will be used (i.e. zero/few updates vs. lots of frequent updates, as well as they kind of updates).

In my particular case, at the time the record is initially written, the "whole" record will be known, but there will be two possible update scenarios: 1) Data updated "in place" with no size change (i.e. updating the value of an INTEGER); and 2) adding a lightweight edge. I'll assume that scenario #1 does not fragment the record, since it's storage size doesn't change (???), but what happens in the case of adding an edge (or several)? In that case, I assume that each edge will increase the size of the record (but by how much?)?

What might be ideal is a way to specify on a per-record (or per-cluster) basis a specific number of padding bytes, when this is known in advance. Here again, for many database applications, adding padding is probably a good idea (although MongoDB's 2X recommendation seems pretty wasteful), but for applications that are storing billions of records, that overhead adds up quick (disk space may be "cheap", but anything multiplied by 1B is still a lot).

Any thoughts on how COMPRESSION may help (or hurt) this? I assume it would be very efficient at removing fluff from a record, but I've also seen comments that would suggest that COMPRESSION isn't very efficient. My guess is that the "padding" is applied after the compression, since the point of the padding is to leave some free physical space in the disk storage (i.e. compressing after padding would result in most updated records taking up more physical space, which defeats the purpose). Can anyone from Orient explain this in more detail--specifically how COMPRESSION relates the physical disk storage and padding?

--Eric

Eric24

unread,
May 28, 2016, 12:17:57 PM5/28/16
to OrientDB
As a follow-up, I found this very interesting article: http://carloprad.blogspot.it/2014/03/orientdb-on-zfs-performance-analysis.html

The concept (as it relates to disk space usage, which wasn't the primary focus of this analysis) is to essentially move the compression to the file system (ZFS in this case). It also seems to come to the conclusion that ODB's built-in COMPRESSION setting is not very useful. But the file system compression approach may be the best overall solution. In fact, I could envision "aggressive" padding settings (maybe the 2X or more) to leave bigger "virtual holes" at the ODB storage engine level (to prevent record splitting), while leaving the efficient use of physical disk storage to the file system.

--Eric


On Saturday, May 28, 2016 at 1:13:55 AM UTC-5, scott molinari wrote:

Eric24

unread,
May 28, 2016, 12:36:48 PM5/28/16
to orient-...@googlegroups.com
And a specific question for Orient: How does the Class OVERSIZE parameter relate to the Cluster RECORD* parameters? Which takes precedent or overrides the other? Or are both factors used together? Also, I'm assuming that OVERSIZE is also a "factor" (i.e. 2 = allocate 2X of the original written record size, or 100% padding)?

--Eric

scott molinari

unread,
May 28, 2016, 1:35:33 PM5/28/16
to OrientDB
All questions I'd like to know the answer to too.

Scott

Luca Garulli

unread,
May 30, 2016, 10:32:28 AM5/30/16
to OrientDB
Hi guys,
Oversize is per class setting, but is computed per record. So if you do this:

INSERT INTO Employee set name = 'Luca'

And the record is, for example, 100 bytes, with oversize 2, it means OrientDB will store 200 bytes with 100 bytes padding. Any further update where the new size is <= 200 the record is just updated, otherwise will be stored on a new space (with space to reuse).

In the future we could change the underlying storage, so this oversize technique could be ignored. I suggest you to check with different settings if oversize takes pros to your use case or not.



Best Regards,

Luca Garulli
Founder & CEO


On 28 May 2016 at 19:35, 'scott molinari' via OrientDB <orient-...@googlegroups.com> wrote:
All questions I'd like to know the answer to too.

Scott

--

---
You received this message because you are subscribed to the Google Groups "OrientDB" group.
To unsubscribe from this group and stop receiving emails from it, send an email to orient-databa...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Eric24

unread,
May 30, 2016, 12:06:15 PM5/30/16
to OrientDB
Thanks Luca, but the specific question around OVERSIZE is how it relates to the cluster parameter RECORD_GROW_FACTOR. They seem to effectively do the same thing and appear to be usable interchangeably. Does one take precedent over the over? Are both factors applied (i.e. does the actual padding = OVERSIZE * RECORD_GROW_FACTOR)? Can you expand on this?

Either way it seems like there is some value in letting ODB include a fair amount of padding and let the file system (like ZFS) manage the compression. Any comments on that strategy?

Also, can you shed any light on the questions about overhead by property type as well as overhead in the record itself, regardless of what data (if any) is stored in the record?

--Eric

Eric24

unread,
Jun 3, 2016, 9:00:00 AM6/3/16
to OrientDB
Luca (or someone from ODB)--can you provide some additional details on this? If it's in the documentation, I can't find it, and I think these are important things to know:
  1. How does OVERSIZE relate to the cluster parameter RECORD_GROW_FACTOR?
  2. What is actually stored on disk when a new record is written (per-record and per-property)?
  3. What overhead is incurred by storing a dynamic non-schema-defined property (i.e. how is the name of the property stored)?
  4. Does it incur any per-record overhead to define a non-mandatory property in the schema if that property has not been assigned a value?

On Monday, May 30, 2016 at 9:32:28 AM UTC-5, l.garulli wrote:

Andrey Lomakin

unread,
Jun 3, 2016, 9:25:41 AM6/3/16
to OrientDB
Hi
How does OVERSIZE relate to the cluster parameter RECORD_GROW_FACTOR?
This cluster parameter is deprecated and not used.
>What is actually stored on disk when a new record is written (per-record and per-property)?
We save record when we change property inside the record.
>What overhead is incurred by storing a dynamic non-schema-defined property (i.e. how is the name of the property stored)
The record is more compact if you store schema defined property than if you store schema undefined property because instead of names of fields we use ids of properties.
>Does it incur any per-record overhead to define a non-mandatory property in the schema if that property has not been assigned a value?
Do you mean whether we add additional information to record if field is absent in record itself but defined in schema, do not you ? No, we do not add any overhead.

--
Best regards,
Andrey Lomakin, R&D lead. 
OrientDB Ltd

twitter: @Andrey_Lomakin 

Eric24

unread,
Jun 3, 2016, 9:49:54 AM6/3/16
to OrientDB
Thanks Andrey! Let me clarify a few of your answers...

So RECORD_GROWTH_FACTOR is deprecated? I assume that applies to RECORD_OVERFLOW_GROW_FACTOR too?
(I sure wish the documentation could be kept up-to-date--ODB is a very complex system, which is what makes it so appealing, but lack of complete documentation is so frustrating!)

Regarding "What is actually stored on disk when a new record is written (per-record and per-property)?", what I'm asking about is overhead that's actually written to the disk when you save a record, per-record (i.e. a record with a single INTEGER property takes up more disk space than the 4 bytes of the INTEGER--what else is there as "overhead"?) and per-property (i.e. does every property itself take up only the bytes needed to store its actual data? Probably not. So what else is written, by property type, as "overhead"? For example, you say you use IDs of schema-defined properties--so how many bytes is a property ID? Also, do some property types have more overhead than just the ID?)

--Eric

Eric24

unread,
Jun 3, 2016, 10:07:09 AM6/3/16
to OrientDB
PS - Also, it appears that OVERSIZE == 0 by default (per: select expand(classes) from metadata:schema)? Is it a "factor" (i.e. base-record-bytes * OVERSIZE) or a number of additional padding bytes to be added?

Eric Lenington

unread,
Jun 6, 2016, 12:20:14 AM6/6/16
to OrientDB
Another sub-question on this topic: I assume that if you don't add the property at all, then it add zero bytes to the record--correct? But if you add a fixed-size property (i.e. SHORT, INTERGER, LONG, etc.) to a record with its value explicitly set to NULL, does that take up the same amount of space as storing a value? I'm asking because of what happens next: If you later update that NULL property and set it to a non-NULL value, does that change the total size of the record or does it update "in place"?


--

---
You received this message because you are subscribed to a topic in the Google Groups "OrientDB" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/orient-database/Usr1haixKQc/unsubscribe.
To unsubscribe from this group and all its topics, send an email to orient-databa...@googlegroups.com.

scott molinari

unread,
Jun 6, 2016, 2:11:18 AM6/6/16
to OrientDB
Didn't Andrey answer that question with the last answer? 

Do you mean whether we add additional information to record if field is absent in record itself but defined in schema, do not you ? No, we do not add any overhead.

Scott   

Eric Lenington

unread,
Jun 6, 2016, 8:57:59 AM6/6/16
to OrientDB
The first part, yes (I'm just stating my understanding that properties defined in the schema that are not added to a record incur no overhead). The second part is new--whether there is a difference between not adding a property and setting it to NULL. In MSSQL, for example, there is no difference, but my experimentation with ODB suggests that there is. So the question is whether an INTEGER (for example) property set to NULL takes up the same storage space as the same property set to 12345.

--

---

scott molinari

unread,
Jun 6, 2016, 9:34:26 AM6/6/16
to OrientDB
Ahh! Gotcha. I am also very interested in the answer.

Scott

Andrey Lomakin

unread,
Jun 8, 2016, 2:24:12 AM6/8/16
to OrientDB
Hi,
>The second part is new--whether there is a difference between not adding a property and setting it to NULL  
Yes, there is the difference. In last case name of property or its id is added to the record.

On Mon, Jun 6, 2016 at 4:34 PM 'scott molinari' via OrientDB <orient-...@googlegroups.com> wrote:
Ahh! Gotcha. I am also very interested in the answer.

Scott

--

---
You received this message because you are subscribed to the Google Groups "OrientDB" group.
To unsubscribe from this group and stop receiving emails from it, send an email to orient-databa...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Eric Lenington

unread,
Jun 8, 2016, 9:47:04 AM6/8/16
to OrientDB
OK. So the space used is just the ID itself? How many bytes is the ID?



---
You received this message because you are subscribed to a topic in the Google Groups "OrientDB" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/orient-database/Usr1haixKQc/unsubscribe.
To unsubscribe from this group and all its topics, send an email to orient-databa...@googlegroups.com.

Andrey Lomakin

unread,
Jun 8, 2016, 9:54:15 AM6/8/16
to OrientDB
Depends on ID value, typically 1-2 bytes.
Reply all
Reply to author
Forward
0 new messages