A couple suggesitons for BSON

327 views
Skip to first unread message

alvaro

unread,
Mar 17, 2012, 1:34:18 PM3/17/12
to bs...@googlegroups.com
I have some remarks about the BSON format's arrays:

Arrays in BSON are stored as documents (objects) whose properties are integer numbers written as text. I think that adds unnecessary informations, as arrays have consecutive indices so that they don't need to be stored. Also, indices have variable length ("1" is shorter than "230") which makes random access to elements hard even for fixed size values.

Packing array elements together without their indices would save space, simplify encoding/decoding and allow faster access to individual elements.

Without indices array elements would still have a tag for their type, but very often array elements are all the same type (usually numerical data). An additional improvement would be to specify a new array type "Homogenous Array" for these arrays.

In homogenous arrays the element type would be stored only once at the beginning.

This could be something like this: (new types to keep backward compatibility)

"\x13" e_name array  // contiguous array (no indices)

"\x14" e_name element_type untyped_array // homogenous contiguous array

where array is like a "document" but in which elements don't have an e_name (because they are known to be "0", "1", "2", etc.)

and untyped_array is like an array but in which elements don't even have a type tag, because all are equal and that type is placed at the beginning (element_type).

This should save space and time.

Any thoughts?

Claudio Bisegni

unread,
Mar 17, 2012, 6:18:18 PM3/17/12
to bs...@googlegroups.com
I agree I had post, some hour ago, the same proposal for array.

Bernd Späth

unread,
Mar 18, 2012, 3:14:15 PM3/18/12
to BSON
Sorry to disagree on this.

The main point of TLV-style encoding ist neither compactness nor
speed, but its simplicity.

In my opinion none of the advantegs to expect would give rise to
sacrifice the simplicity property of the whole format.

Regarding access-times you shouldn't confuse the type BSON-array with
C-style arrays.
The major increase in Speed O(1) instead of O(n) ist due to the fact,
that given _fixed_ element sizes and tight packing the n-th element
will be accessable by simple pointer arithmetic using element-size *
index as an offset to the first array element.

Unfortunately the assumption of fixed size elements doesn't hold true
for the BSON-array type, but only for a small subset of possible
element-types:

* null, true, false : none of this would seem to make any sense as
the type
of a homogenous array as every single value would have to be the
same value.

* Floating point (0x01), UTC datetime (0x09), Timestamp (0x11), 64-
bit integer (0x12)
all used a fixed size of 64-bit - so agreed on this.

* 32-bit integer (0x10) and ObjectId (0x07) use fixed sizes of 32-
bit and 12 bytes
agreed on this

* any other of the other encodable types don't share a fixed length,
so an increase in
speed would still be impossible.

As it turns out on the one hand just in a very small subset any gain
in speed would be to expect, while on the other supporting faster
access times for fixed size-elements if you are really in need to is
already possible given the Binary/Generic subtype (0x00) of the Binary
data type (0x05).

E.g: use the first byte to encode type-information of your homogenous
array, and insert values with bson_little_endian32 or
bson_little_endian64 depending on the given type as mentioned above
(i.e.: 32 for 32-bit ineger values, 64 for floating point, UTC
datetime, Timestammp and 64-bit integer).


Talking about size again this issue is more or less outside the scope
of TLV-encoding schemes. You might want to take a look at ASN.1,
especially in its PER encoding if you are going for a space-efficient
encoding.

Just to give an example of what I am talking about:

Even if you allowed true an false to be shared in a homegounous array
- you would have to define a new BSON-type for this(!) you would still
be faced with the fact of wasting 8 times the space required compared
to the same data stored in a bitfield.

Claudio Bisegni

unread,
Mar 18, 2012, 3:28:41 PM3/18/12
to bs...@googlegroups.com
partially agree,
for heterogeneous array i agree, but for  homogenous array not much.
Yes it's possible to use subtype for encode the type but this is not BSON compliant. BSON is great also because it bring information about data structure. If i encode into the binary type some structured data with the only subtype i can't' reconstruct it.
Thinking to a new type of array, for homogenous data,  i could store the structure and then all the data. shire some other little information need to be encoded for reorganize offset of the key internally to the vector data. 

Alvaro Segura

unread,
Mar 18, 2012, 5:06:03 PM3/18/12
to bs...@googlegroups.com
Let me explain this in another way, by example.

I'm thinking about applications containing numerical data. E.g:

- a list of countries or regions, each including in a property a large
array of coordinates defining its boundaries.
- 3D models stored as documents with structure, which include large
arrays of vertices, normals, face indices, etc. Thousands of elements.
- statistical data of any kind


but which needs to be interpretable by Mongo queries, updates, etc. (so
binary does not help).


For example, these are the first two points in the boundary of Algeria
(the complete contour would be much longer):

[771230.894373, 4422896.962001, 804804.852796, 4451159.130080]

in current BSON this would be stored including an "index value" for each
: so, conceptually as if it was this:

{"0": 771230.894373, "1": 4422896.962001, "2": 804804.852796, "3":
4451159.130080}

(for a complete detailed contour indices would grow much more: ...
"3542": 131633.628071, "3543": 4371332.547494 ...)

That redundancy is what kind of bothers me: Storing indices for
non-sparse arrays. And as text.

The new array type I propose would omit the redundant 'e_name' of each
element (but would keep an int32 size at the beginning).

Following the example, each array element also identifies its type. So
we have something comparable to:

{"0": double 771230.894373, "1": double 4422896.962001, "2": double
804804.852796, "3": double 4451159.130080}

In this case, all those types could be moved to the beginning as they
are all "double". Thus saving significant space. This is the special
"homogenous array" type I further propose.

BTW, true and false both have type ID \x08 (boolean), or am I missing
something?. A hypothetical homogenous "array of booleans" would have
element type \x08 and its content would be a sequence of \x00 and \0x01
(for false or true elements).


About random access through pointer arithmetic: yes, obviously I'm
talking about elements of the same size like those you mention. But this
advantage is small, I know.

And well, I agree, the format specification simplicity would probably
suffer with these changes. Just wanted to discuss these issues.


Best regards

El 18/03/2012 20:14, Bernd Sp�th escribi�:

Bernd Späth

unread,
Mar 18, 2012, 8:54:40 PM3/18/12
to BSON
Thanks for the example. I think I got the point now. Maybe you should
have included it in your first post - sometimes idiots like me need
them.

I merely saw the BSON as format for information interchange rather
than a means of database storage.

Wasting at least half of the amount of valuable data storage used up
by the "homogenous" fixed-size arrays you describe doesn't seem to be
a too clever idea.

Even worse the idea of unnecessarily having to toss around megabytes
of data just to query and update single element prooves the suggested
binary encoding scheme to be a non-solution to your problem.

Given the requirements of your example I completely agree with your
observation, that using the BSON-encoding as binary storage format
seems to be more or less impractical.

Nevertheless the whole argument seems to concern using the BSON-format
as internal data representation format. Are you sure they really use
it this way?

Even if they did, I'm not quite sure if the BSON-spec would be the
right place to fix.

Effectively changing the format of information interchange won't
change any of the data structures used internally.
Nothing would prevent any application from transforming an extended
encoding format into the old subset and contiunue to store data as
before.
On the other hand even sticking to the classic encoding-format any
application could transform the data into a somewhat more optimal
internal data format.


As it seems you even got me twice: Though two different JSON-types
true and false both encode to the same BSON-type. Noting missed by
you, as the fault was all mine.

So no type-invention would be necessary here.
Unfortunately it still wouldn't save any of the factor 8 overhead
mentioned in my erroneous
statement above.

best regards
Reply all
Reply to author
Forward
0 new messages