Comments from a new implementor

114 views
Skip to first unread message

Jay McCarthy

unread,
Mar 19, 2010, 5:52:28 AM3/19/10
to BSON
I don't anticipate BSON will change, but here are some comments after
implementing it yesterday:

http://jay-mccarthy.blogspot.com/2010/03/bson.html

Michael Dirolf

unread,
Mar 19, 2010, 9:53:56 AM3/19/10
to bs...@googlegroups.com
Jay,
I think that's a great post - the original design for BSON is courtesy
of Dwight/Eliot so maybe they have some more thoughts, but I think
some of the points you bring up are valid. One point about prepending
document size is that it makes the format easy to traverse (including
embedded docs, etc.). This is one of the core goals for the format
(even if it can be a bit annoying to implement in some languages).

As for factoring - I think I agree, not sure what the reasoning is
behind the current ordering. There are definitely some things that are
more verbose than necessary as well, agreed.

If a BSON v2 ever comes out I'd expect to see some of these gripes
addressed, but I wouldn't expect to see that switch anytime soon ;).

Thanks again for sharing your thoughts,
Mike

> To unsubscribe from this group, send email to bson+unsubscribegooglegroups.com or reply to this email with the words "REMOVE ME" as the subject.
>

Jay McCarthy

unread,
Mar 19, 2010, 11:18:30 AM3/19/10
to BSON

On Mar 19, 7:53 am, Michael Dirolf <m...@10gen.com> wrote:
> Jay,
> I think that's a great post - the original design for BSON is courtesy
> of Dwight/Eliot so maybe they have some more thoughts, but I think
> some of the points you bring up are valid. One point about prepending
> document size is that it makes the format easy to traverse (including
> embedded docs, etc.). This is one of the core goals for the format
> (even if it can be a bit annoying to implement in some languages).

I have no objections to it. I think it would be nice to make it more
DB friendly. Does MongoDB internally reorder the elements to make them
more searchable?

> As for factoring - I think I agree, not sure what the reasoning is
> behind the current ordering. There are definitely some things that are
> more verbose than necessary as well, agreed.

Regarding the ordering, one side effect of the current way is that
\x00 is essentially a type tag that means "No more elements". If the
order was as I mentioned, then the parser loop would look instead for
an empty string as an element name. That means, of course, that the
order I proposed would disallow such element names. I'm not sure if
that's an intended feature of the current spec.

> If a BSON v2 ever comes out I'd expect to see some of these gripes
> addressed, but I wouldn't expect to see that switch anytime soon ;).

I don't expect it either! :P

Jay

Michael Dirolf

unread,
Mar 19, 2010, 11:44:44 AM3/19/10
to bs...@googlegroups.com
On Fri, Mar 19, 2010 at 11:18 AM, Jay McCarthy <jay.mc...@gmail.com> wrote:
> On Mar 19, 7:53 am, Michael Dirolf <m...@10gen.com> wrote:
>> Jay,
>> I think that's a great post - the original design for BSON is courtesy
>> of Dwight/Eliot so maybe they have some more thoughts, but I think
>> some of the points you bring up are valid. One point about prepending
>> document size is that it makes the format easy to traverse (including
>> embedded docs, etc.). This is one of the core goals for the format
>> (even if it can be a bit annoying to implement in some languages).
>
> I have no objections to it. I think it would be nice to make it more
> DB friendly. Does MongoDB internally reorder the elements to make them
> more searchable?

No, the server doesn't do any reordering or anything like that. Adding
ordering or some other mechanism to make documents easier to search
would be an interesting thing to experiment with. I think one issue
there would be that it would take BSON a little bit further away from
the JSON spec (not necessarily a deal-breaker though IMO). It would
also create some trickiness because currently we enforce that command
"verbs" come first in documents (for the same reason - so we don't
have to search every key). Could certainly work around that as well
though.

>> As for factoring - I think I agree, not sure what the reasoning is
>> behind the current ordering. There are definitely some things that are
>> more verbose than necessary as well, agreed.
>
> Regarding the ordering, one side effect of the current way is that
> \x00 is essentially a type tag that means "No more elements". If the
> order was as I mentioned, then the parser loop would look instead for
> an empty string as an element name. That means, of course, that the
> order I proposed would disallow such element names. I'm not sure if
> that's an intended feature of the current spec.

I think allowing the empty key is important - but that last null byte
is already a bit superfluous since we have the document length prefix.
So reordering wouldn't necessarily break anything as far as I can
tell.

Dwight Merriman

unread,
Mar 19, 2010, 4:26:00 PM3/19/10
to bs...@googlegroups.com
the original intent with BSON was that the keys are in a well-defined order.  in mongodb, we generally keep the keys in the order they arrive in.  some drivers map bson straight to orderless key/value dictionaries.  the JSON RFC says something grey about this, something like "SHOULD" maintain order.

karl

unread,
Mar 24, 2010, 4:15:38 PM3/24/10
to BSON
You don't have to walk the data structure twice to generate the
length. In our implementation for Norm (a MongoDB driver in C#) we put
a placeholder and then seek back to the location to write the final
length once we have it. The result was cleaner and considerably faster
code.

I just extracted the bson stuff to its own library, you can check it
out at:
http://github.com/kseg/Metsys.Bson/blob/master/Metsys.Bson/Serializer.cs

(pay particular attention to the NewDocument and EndDocument
implementations).

James Newton-King

unread,
Mar 24, 2010, 9:40:41 PM3/24/10
to BSON
The two main issues I've encountered with BSON that impact the end
user experience are:

- No way to specify a date time's time zone. If the user converts a
date time with time zone to BSON it is easy enough to convert it UTC
time and get the ticks, but converting back again you have no
knowledge of what the time zone originally was and the user always
gets UTC
- No way to specify in BSON whether the root "document" is an object
or an array


~ James

Dwight Merriman

unread,
Mar 25, 2010, 12:17:21 PM3/25/10
to bs...@googlegroups.com
very good point regarding the root document

i think there should be a rev of bson at some point, and we should all participate in what changes, and we should be very careful what changes before making any change



Joe

unread,
May 12, 2010, 5:27:02 PM5/12/10
to BSON
I agree with this.
The document size is really really annoying. I makes it impossible to
open a stream and blast out a document without having to seek

Dwight Merriman

unread,
May 12, 2010, 5:41:16 PM5/12/10
to bs...@googlegroups.com
Not sure i understand - i mean, i could see why someone wants to stream giant objects (which it doesn't do right now), but putting the size up front if the object does fit in ram is normally not too hard?  you just leave space for it and when done with the buffer fill it in.  That is what the C++ BSONObjBuilder class does.

Mathias Stearn

unread,
May 12, 2010, 5:53:26 PM5/12/10
to bs...@googlegroups.com
It is a trade-off of some efficiency at serialization/write time for a
lot of efficiency at deserialization/read time. If the size wasn't
included it would be very inefficient to slurp a single object off the
network for example.

On Wed, May 12, 2010 at 5:27 PM, Joe <lord...@gmail.com> wrote:

Joe

unread,
May 13, 2010, 1:00:40 PM5/13/10
to BSON
It's inefficient only for languages where you need to allocate memory.
C#, Java, Python, F#, Ruby etc. Its inefficient and cumbersome have
the size upfront.

I'm also finding it annoying have to write typeID name value. name
typeid value would be much nicer as well.

Mathias Stearn

unread,
May 13, 2010, 1:33:56 PM5/13/10
to bs...@googlegroups.com
On Thu, May 13, 2010 at 1:00 PM, Joe <lord...@gmail.com> wrote:
> It's inefficient only for languages where you need to allocate memory.
> C#, Java, Python, F#, Ruby etc. Its inefficient and cumbersome have
> the size upfront.

Its not just about allocating memory. In basically all languages when
you read from the network you need to specify a size to read. If you
don't know the size up-front, you need to do byte-by-byte scanning for
a terminator which is slow.

Also, all languages need to allocate memory, some just make it
implicit. Just because your language handles this for you doesn't make
it more efficient.

> I'm also finding it annoying have to write typeID name value. name
> typeid value would be much nicer as well.

I felt the same way at first. While it makes the serialization
interface for Map<String, Value> a bit more annoying, it does make it
work better as an in-memory data structure. Additionally, it has the
added benefit of making the the null terminator a valid element with a
type of 0 (EOO).

In general, most of the "oddities" of BSON can be explained by its use
as a in-memory data structure, one of the unique features of BSON vs
other formats. Of course there are also some legacy decisions that
can't be changed because of backwards compatibility (Binary subtype 2
is the best example of this).

Joe

unread,
May 13, 2010, 4:49:15 PM5/13/10
to BSON
> Its not just about allocating memory. In basically all languages when
> you read from the network you need to specify a size to read. If you
> don't know the size up-front, you need to do byte-by-byte scanning for
> a terminator which is slow.

You wouldn't need to read byte by byte You only need to read one byte
before each type. Which you must be doing already to know what each
element type is
For example { "Hello" : "world" }
2Hello6world0

Read byte (type bit)
ReadString Hello
Read Int32 string size
Read(string size)
ReadByte (end of doc)

so pseudo code'ish

while(b = ReadByte() != 0)
{
switch(b)
{
...
//read in chunks
}
}

Throw in a buffered stream which is reading large blocks from the
network already everything is probably in memory anyways so the
occasional readbyte won't hurt anything



On May 13, 10:33 am, Mathias Stearn <math...@10gen.com> wrote:

Mathias Stearn

unread,
May 13, 2010, 5:05:09 PM5/13/10
to bs...@googlegroups.com
You are assuming that you want to parse the whole object and all
nested objects as they come off the wire. In some cases its easier to
just grab a bson object and send it somewhere else to be parsed
on-demand. Its definitely useful to be able to completely skip over a
whole nested object or array, which you couldn't do if objects didn't
have a size prefix.

Michael Ashton

unread,
May 14, 2010, 1:42:47 PM5/14/10
to BSON


On May 13, 2:05 pm, Mathias Stearn <math...@10gen.com> wrote:
> You are assuming that you want to parse the whole object and all
> nested objects as they come off the wire. In some cases its easier to
> just grab a bson object and send it somewhere else to be parsed
> on-demand. Its definitely useful to be able to completely skip over a
> whole nested object or array, which you couldn't do if objects didn't
> have a size prefix.

Heartily concur, for the in-memory case. Imagine reading BSON from a
slow serial flash memory, or even just an on-disk file -- it could be
really prohibitive to scan each and every byte looking for a
terminator, but you could use offsets and lengths to quickly seek
around the file.

I also agree that writing lengths and such up front can be very
inconvenient for generating streams. Maybe there could be an option:
for example, a zero length would indicate that the block was
terminated by some magic 4-byte sequence, which would be escaped if it
appeared in the file. This could be converted as necessary.

But if BSON is meant more for memory (random access) than for
streaming (iteration), it should maybe stick with lengths and offsets.

--mpa
Reply all
Reply to author
Forward
0 new messages