Storage of field names?

198 views
Skip to first unread message

Martin Häusler

unread,
Jul 8, 2022, 7:46:10 AM7/8/22
to wiredtiger-users
Hello everyone,

I'm curious about the storage of document nodes. Every document (within a collection) can theoretically have a totally different schema. The straight-forward way to store a document node is to convert it into BSON, treat the byte array as an atomic value and insert it into the B-Tree, with the document ID as key. However, this seems rather wasteful, because in realistic use cases we end up repeating the same property names over and over for every document. In SQL storage engines, this problem does not exist because columns have fixed sizes and cells can be addressed via simple index offsets.

How (if at all) does WiredTiger address this issue? Does it rely on block-level compression to get rid of the property name duplicates? Or is the overhead so small that it doesn't affect storage footprint and performance in practice?

Thank you!

Martin

Keith Smith

unread,
Jul 8, 2022, 8:33:14 AM7/8/22
to wiredtig...@googlegroups.com
Hi Martin,

You've got it exactly right. At the WiredTiger level, a MongoDB collection is stored as a BTree indexed by the document ID and with the BSON documents as record values. WiredTiger reduces the overhead of repeated property names by compressing blocks of data when it writes them to storage. I believe MongoDB is using skippy compression by default, although there is also support for zstd and possibly others.

I don't know if we've done a precise comparison of the net storage space used by compressed BSON compared to a traditional relational schema. I do know that when MongoDB started using WiredTiger about seven years ago, one of the big selling points was the reduced data size due to compression.

Thank you for your interest and questions. I hope this explanation helps.

Keith 

--
You received this message because you are subscribed to the Google Groups "wiredtiger-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to wiredtiger-use...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/wiredtiger-users/d2806fcd-a990-4765-9727-c5b518a1b8e6n%40googlegroups.com.

Keith Smith

unread,
Jul 8, 2022, 9:07:46 AM7/8/22
to wiredtig...@googlegroups.com
> I believe MongoDB is using skippy compression by default, although there is also support for zstd and possibly others.

Ooops.  That should have said "snappy compression."  Part of my brain must have been thinking of the Skippy paper from SIGMOD 2008 :-)

Keith

Martin Häusler

unread,
Jul 8, 2022, 10:14:34 AM7/8/22
to wiredtiger-users
Thank you for the quick reply Keith, that fully answers my question :)
Reply all
Reply to author
Forward
0 new messages