[BSON] Well-defined Key Order

205 views
Skip to first unread message

Juan

unread,
Nov 28, 2010, 10:41:12 AM11/28/10
to BSON
Hello!

First off, BSON is awesome! Thanks for defining the spec!! now, it's
late, so straight to the meat:

I believe very strongly that the spec should add the requirement that
Document keys be ordered. This has been mentioned previously in other
posts, but I figured it deserves it's own topic. So be it!

The main problem is that without ensuring the key order, BSON
documents can have multiple representations. As was mentioned two days
ago by Martin Kou, hmac (and other crypto functions) need this in
order to properly sign documents. In addition, distributed
applications cannot use hashes to compare whether documents should be
sent.

Back in March, there was discussion on the topic, and Dwight Merriman
stated that the original intent with BSON was to keep the keys in a
well-defined order. He mentioned that in mongodb, the keys are kept in
the order they arrive in. I imagine this is good for growing documents
in memory or disk: leave some buffer space and you simply append new
values to the end. At the same time, appending to nested documents
would shift the whole thing, so the savings probably aren't amazing
for larger docs. It's also not great for implementations in various
languages where straight translation to a dictionary looses the
order.

Perhaps keeping keys sorted may be the way to go. Though this sucks
for keeping things in memory, it probably would help searching large
docs. I know the BSON limit is 4Mb, but that's plenty of space for
nested meta data. I also saw somewhere there are plans to increase it
to 16Mb?

Whether keys are ordered by insertion or actually sorted, i think
there MUST be some order. So I wanted to spark off discussion about
it, and see if we can get a solution that works well.

Thoughts?

Cheers!
Juan

Scott Hernandez

unread,
Nov 28, 2010, 11:07:14 AM11/28/10
to bs...@googlegroups.com
This seems like more of a question for MongoDB and less for the BSON spec.

In some applications the key order might be significant and defining
that BSON always have a pre-defined order would be harmful.

As for the usage in mongodb I would agree that documents should have
pre-defined order of keys (or always be sorted) before any comparison,
or storage. This means that many of the clients/drivers would not have
to be so careful about the order of the keys sent to the server, and
many problems and much confusion could be avoided. Unfortunately it
will negatively affect performance, and not just when storing the
data.

Dwight Merriman

unread,
Nov 28, 2010, 11:08:45 AM11/28/10
to bs...@googlegroups.com
i'd be ok with making it explicit in BSON that bson is ordered.

i think having hte keys always sorted would work less well imo.  sometimes, not always, people expect JSON to maintain key order : we wouldn't be able to do that anymore.

Juan Batiz-Benet

unread,
Nov 28, 2010, 12:05:21 PM11/28/10
to bs...@googlegroups.com
On Nov 28, 2010, at 8:07 AM, Scott Hernandez wrote:

> This seems like more of a question for MongoDB and less for the BSON spec.

Not really. I am using BSON in various applications that do not use MongoDB. Trading objects in distributed systems benefits significantly from using hashes to stay in synch, rather than sending whole objects for comparison.

I'll give you one example: I'm working on a framework for parallelizing algorithms. We were using JSON, but since we care about performance, we may move to BSON. However the data exchange hinges on using sha hashes to avoid sending large datasets over multiple times.

> In some applications the key order might be significant and defining
> that BSON always have a pre-defined order would be harmful.

Could you give a concrete case of this?

> As for the usage in mongodb I would agree that documents should have
> pre-defined order of keys (or always be sorted) before any comparison,
> or storage. This means that many of the clients/drivers would not have
> to be so careful about the order of the keys sent to the server, and
> many problems and much confusion could be avoided. Unfortunately it
> will negatively affect performance, and not just when storing the
> data.

I think when dealing with sending queries to a server, the performance hit for ordering the query should be on the client side. I agree that it is a pain and confusing for users to have to specify queries or objects in a specific order. So what if the user specifies some query or object using a dictionary, not paying attention about the order, then the client code orders the keys according to the pre-established order (when dumping the dictionary into BSON), and delivers the (correctly ordered) BSON doc to the server. Thus, the server would see no performance hit from giving users the flexibility to specify keys in any order (at the dictionary level).

Juan Batiz-Benet

unread,
Nov 28, 2010, 12:26:17 PM11/28/10
to bs...@googlegroups.com
On Nov 28, 2010, at 8:08 AM, Dwight Merriman wrote:

i'd be ok with making it explicit in BSON that bson is ordered.

awesome!


i think having hte keys always sorted would work less well imo.  sometimes, not always, people expect JSON to maintain key order : we wouldn't be able to do that anymore. 

Is this case significantly more prevalent than the benefit of being able to search documents quickly? personally, i'd trade tweaking my api for searching significantly faster. 

What about other approaches?

Would specifying two modes be confusing? As in, letting the document pick between a JSON style insertion-based order, and a sorted order.

Or perhaps sorted documents should just be a different element type. That way, those that want to build applications where all documents are sorted, can have all their BSON docs contain a single nested sorted document. We lose two bytes, but gain a lot of expressive power, and the default document order can be JSON style order.

Juan

Scott Hernandez

unread,
Nov 28, 2010, 1:03:13 PM11/28/10
to bs...@googlegroups.com
On Sun, Nov 28, 2010 at 9:05 AM, Juan Batiz-Benet
<jbe...@cs.stanford.edu> wrote:
> On Nov 28, 2010, at 8:07 AM, Scott Hernandez wrote:
>
>> This seems like more of a question for MongoDB and less for the BSON spec.
>
> Not really. I am using BSON in various applications that do not use MongoDB. Trading objects in distributed systems benefits significantly from using hashes to stay in synch, rather than sending whole objects for comparison.
>
> I'll give you one example: I'm working on a framework for parallelizing algorithms. We were using JSON, but since we care about performance, we may move to BSON. However the data exchange hinges on using sha hashes to avoid sending large datasets over multiple times.

And how does your question apply to these applications? How should the
spec. change? Are you just asking for order-preserving processing for
implementations that touch the documents as they transform them?

Nothing in the spec defines comparison operations or ordering.

>> In some applications the key order might be significant and defining
>> that BSON always have a pre-defined order would be harmful.
>
> Could you give a concrete case of this?

Sure, take a document where metadata is in the front of the document
and there are embedded docs that the metadata applies to. It is much
more efficient to not have keys in some pre-defined order (like having
been sorted) where the metadata may end up at the end, not the front
because of reordering.

Imagine the mp3 format converted to bson, for example.

>> As for the usage in mongodb I would agree that documents should have
>> pre-defined order of keys (or always be sorted) before any comparison,
>> or storage. This means that many of the clients/drivers would not have
>> to be so careful about the order of the keys sent to the server, and
>> many problems and much confusion could be avoided. Unfortunately it
>> will negatively affect performance, and not just when storing the
>> data.
>
> I think when dealing with sending queries to a server, the performance hit for ordering the query should be on the client side. I agree that it is a pain and confusing for users to have to specify queries or objects in a specific order. So what if the user specifies some query or object using a dictionary, not paying attention about the order, then the client code orders the keys according to the pre-established order (when dumping the dictionary into BSON), and delivers the (correctly ordered) BSON doc to the server. Thus, the server would see no performance hit from giving users the flexibility to specify keys in any order (at the dictionary level).

In order to ensure the ordering you need to re-sort every place where
you don't know for sure that the order is correct, like data coming
from outside of your library/code. You cannot assume that the client
will send the "correctly ordered" BSON doc to the server. This will be
especially true as soon as the spec changes. For that matter, maybe
the doc should embed the spec version for compatibility, and to
address these issues.

At the moment order is preserved in the mongodb implementations
(server + drivers) and that order is important. For example, all
commands require the name of the command to be the first field.

{getLastError:1, fsync:1} is different than {fsync:1, getLastError:1};
the latter doesn't work.

As Dwight said, making that part of the spec makes sense and it clears
up the issue for those that implement the spec; it is also (mostly)
backwards compatible.

Juan Batiz-Benet

unread,
Nov 29, 2010, 6:43:07 AM11/29/10
to bs...@googlegroups.com
thanks for the discussion. This is very useful :)

On Nov 28, 2010, at 10:03 AM, Scott Hernandez wrote:

> And how does your question apply to these applications? How should the
> spec. change? Are you just asking for order-preserving processing for
> implementations that touch the documents as they transform them?

Yes. Ensuring order-preservation would be a lot easier if the order was based on sorting, but, as Dwight mentioned, JSON prefers the order to be insert based.

> Nothing in the spec defines comparison operations or ordering.

Correct, so a host can turn a BSON doc into some other representation, then back into BSON and now the BSON representations are no longer the same, at the byte level. Given BSON's "binary object" nature, it makes most sense to simply compare the bytes directly. Requiring that the object be parsed is not good, imo.

> Sure, take a document where metadata is in the front of the document
> and there are embedded docs that the metadata applies to. It is much
> more efficient to not have keys in some pre-defined order (like having
> been sorted) where the metadata may end up at the end, not the front
> because of reordering.

meta-data keys are inherently different than regular data keys. If they're at the same level (same document), then why not prepend those keys with a symbol to denote they are special? Take for instance, mongodb's query operators, where a $ is prepended.

>
> Imagine the mp3 format converted to bson, for example.

I don't really see your point here. The entire mp3 file itself is a big array of Header/Data segments, which is not a problem as arrays in BSON already preserve order. whether the file should be in a 2nd level proper bson array, or be in the top level doc where the keys are the index (basically just how BSON arrays are currently implemented) is up for debate. For instance, i could see an mp3 BSON format like so:

>> { 'mp3data' : [ [ header, data ], [ header, data ], ... ] }


where data is a binary object, and header is either a document outlining the fields:

>> { 'version' : 1, 'layer' : 1, 'error' : 1, 'bitrate' : 160, 'frequency' : 0, 'pad' : 0, ' }


or (for space efficiency) an array (taking advantage of the well-defined order of the mp3 header fields):

>> [1, 1, 1, 160, 0, 0]


or (for MORE space efficiency) a binary object ... (but then what is the point of converting to BSON in the first place?!):

>> 0xFFFBA040 (including the sync word and with proper field widths)


Your example brings up a very good point: cases were a particular preconceived order is crucial tend to be well-defined formats (e.g. mp3, ip packets). These formats, wich are have a particular binary representation, translated to BSON will always see a space efficiency drop (BSON being schemaless). In addition, since the field order is well-defined and could be encoded either as an array (without field keys, maximizing space efficiency) or a document (maximizing ease of use, e.g. mp3data[0]['version'] = 1). if encoded as a dictionary, applications will be accessing based on keys, so the order of the keys in memory doesn't seem to matter.

> In order to ensure the ordering you need to re-sort every place where
> you don't know for sure that the order is correct, like data coming
> from outside of your library/code. You cannot assume that the client
> will send the "correctly ordered" BSON doc to the server. This will be
> especially true as soon as the spec changes. For that matter, maybe
> the doc should embed the spec version for compatibility, and to
> address these issues.

If order is part of the spec, then you CAN assume it. That's what prompted this request.
You bring up a very important point. Where could it go?

If it became part of the doc definition, then it would also appear in embedded docs, which is redundant. (allowing different bson version docs to be embedded seems like a huge mess in the long run.). Perhaps only the first doc must specify it, and embedded docs inherit when unspecified? this isn't clean...

prepending an additional byte to the whole bson representation would cause a lot of driver rewriting, but then again, a whole new version would anyway...


> At the moment order is preserved in the mongodb implementations
> (server + drivers) and that order is important. For example, all
> commands require the name of the command to be the first field.
>
> {getLastError:1, fsync:1} is different than {fsync:1, getLastError:1};
> the latter doesn't work.

Why not? is mongodb using pre-defined byte offsets? What is the benefit of using docs instead of arrays then? If it is only for readability, then what governs the ordering? semantics? As you mentioned earlier, i think requiring an order is confusing for users.

> As Dwight said, making that part of the spec makes sense and it clears
> up the issue for those that implement the spec; it is also (mostly)
> backwards compatible.

Sure, then drivers should not simply turn docs into the language's native dictionary, as these tend to not preserve insertion order.

imo it's a lot cleaner if the order were to be sorted. With insertion order, if i turn a BSON object into some other format and then turn it back, im screwed. format-to-format specific encoding (as opposed to format - native language dict - format) would need to be in place, which seems less than ideal going forward. Then again, substantial existing code assumes insertion order.

Andrew Rondeau

unread,
Nov 29, 2010, 3:19:39 PM11/29/10
to BSON
Sorting BSON documents based on key won't happen because there are
implementations that depend on key order for processing efficiency.

For example, in languages where BSON documents are mapped to classes
with inheritance hierarchies; keeping type determination values at the
beginning of the document is critical for performance. We've discussed
this quite a bit regarding 10gen's official C# driver's type-mapping.
It puts type-mapping values at the beginning of the document so it can
deserialize as efficiently as possible.

Although my applications do not depend on key order; my understanding
is that the better drivers allow you to manipulate key order; thus
your applications can generate documents with sorted keys if you want
to.

I must admit that I'm interested in guidelines so that, given a
document, it's always serialized and deserialized to-from the same
byte representation. Specifically, some kind of guidelines so that the
same semantic document will always serialize to the same bytes no
matter what language or driver.

Scott Hernandez

unread,
Nov 29, 2010, 3:27:07 PM11/29/10
to bs...@googlegroups.com
On Mon, Nov 29, 2010 at 12:19 PM, Andrew Rondeau
<andrew....@gmail.com> wrote:
> Sorting BSON documents based on key won't happen because there are
> implementations that depend on key order for processing efficiency.
>
> For example, in languages where BSON documents are mapped to classes
> with inheritance hierarchies; keeping type determination values at the
> beginning of the document is critical for performance. We've discussed
> this quite a bit regarding 10gen's official C# driver's type-mapping.
> It puts type-mapping values at the beginning of the document so it can
> deserialize as efficiently as possible.
>
> Although my applications do not depend on key order; my understanding
> is that the better drivers allow you to manipulate key order; thus
> your applications can generate documents with sorted keys if you want
> to.
>
> I must admit that I'm interested in guidelines so that, given a
> document, it's always serialized and deserialized to-from the same
> byte representation. Specifically, some kind of guidelines so that the
> same semantic document will always serialize to the same bytes no
> matter what language or driver.

All drivers have support for OrderedMap/Hash/ListOfPairs (where the
insertion order is kept). If they don't then they can't generically
support commands. These are generally called a BSONDocument, or
Document.

Andrew Rondeau

unread,
Nov 30, 2010, 3:10:10 AM11/30/10
to BSON, jbe...@cs.stanford.edu
I've been experimenting with adapting Salmon's Magic Signatures to
BSON. Basically, it uses SHA256 signatures to verify that a document
came from who it came with; and thus needs the same kind of byte-for-
byte serialization/deserialization logic that Juan's hashing needs.

I'm currently calling it Signed BSON.

http://appfeeds.repositoryhosting.com/hg_public/appfeeds/objectcloud-signedbson

Originally, in order to achieve this, I stored the signed document in
a byte array within a parent document. In response to this thread, I
altered my code to merely store a BsonDocument or a strongly-typed
object. After my changes, all of my unit tests pass. My code is C#.

I still haven't tested my implementation for cross-language
compatibility; although assuming that there are no glaring errors in
the drivers, and that the implementations preserve the same document
order that I used in C#, everything will work.

Thus, as my library demonstrates, there's no need to sort
BsonDocuments by key order in order to hash correctly. Everything
should work assuming that implementations in other languages keep
items in documents in the correct order. Of course, if you don't
believe me, I'm perfectly willing to assist someone in porting Signed
BSON to another language that supports SHA256 signatures. ;)

So, I can conclude, and hopefully other readers can conclude, that
hashing and other cryptographic use of BSON merely requires using a
well-written driver and properly using the APIs to use so that key
order is preserved when re-serializing. There's no need to alter the
specification and break all of the programs that rely on specified key
order in order to support hashing and other cryptographic use of BSON.

On Nov 28, 7:41 am, Juan <jbe...@cs.stanford.edu> wrote:

Juan Batiz-Benet

unread,
Nov 30, 2010, 4:10:40 AM11/30/10
to Andrew Rondeau, BSON
On Nov 30, 2010, at 12:10 AM, Andrew Rondeau wrote:

Thus, as my library demonstrates, there's no need to sort
BsonDocuments by key order in order to hash correctly. Everything
should work assuming that implementations in other languages keep
items in documents in the correct order. Of course, if you don't
believe me, I'm perfectly willing to assist someone in porting Signed
BSON to another language that supports SHA256 signatures. ;)

You are assuming that the object only ever lives as a BSON Document object in the particular driver implementation. Converting to another format and back (like to a native language dictionary) yields BSON docs that may not be equivalent. 

In addition, Not all drivers keep a BSON Document. To access the object, they convert the document to a native language dictionary and return it, discarding all sense of order. The reason i noticed this problem was because both drivers for ObjC that I tried transform bson representations straight into dictionaries. 

Sure, you can claim we need better drivers, but this does not solve inter-format exchanges. While JSON lives on, we'll see a lot of BSON to JSON conversion, and though the JSON RFC calls for preserving order, a lot of JSON driver implementations do not. One of the beautiful aspects of JSON and BSON is the ability to express the object as a simple native dictionary, and many, many drivers make use of this. The reason JSON order-careless drivers don't really have a problem with this is that when testing for equivalence between JSON docs, applications do not usually compare bytes, they compare the proper objects. Let order alone, JSON allows whitespace between tokens, so when getting a JSON object from some other source they do not control, most apps i've seen do not trust the byte representation; they parse it. This is inefficient, and for BSON, this is far from ideal.

In the end, this problem doesn't really affect my apps particularly. For my purposes, i've established the convention to sort my native dictionary keys before serializing to BSON. Thus, just like Andrew, I can go back and forth easily, and hash or encrypt as I please. This thread is not asking to break all applications out there. It is merely to highlight a weakness and to attempt to fortify the spec going forward. In particular, I hope BSON better supports distributed systems, where conventions beyond the proper spec (e.g. always insert the key 'type' first) will be confusing hurdles, and will erode the simplicity that BSON offers over protobufs or thrifty.

Cheers! 
Juan

Andrew Rondeau

unread,
Nov 30, 2010, 2:25:47 PM11/30/10
to BSON
I wouldn't break applications because of poorly-written drivers. If a
driver isn't maintaining key order, then it's not a problem with BSON.
It's a problem with the driver. Fix the driver instead of trying to
change the spec. I assume the driver is open-source?

Juan Batiz-Benet

unread,
Nov 30, 2010, 9:31:38 PM11/30/10
to bs...@googlegroups.com
On Nov 30, 2010, at 11:25 AM, Andrew Rondeau wrote:

> I wouldn't break applications because of poorly-written drivers. If a
> driver isn't maintaining key order, then it's not a problem with BSON.
> It's a problem with the driver. Fix the driver instead of trying to
> change the spec. I assume the driver is open-source?

The problem is not the drivers. The ambiguity persists even with ideal drivers; converting between formats yields unequal representations for equivalent documents.

Also:

Andrew Rondeau

unread,
Dec 1, 2010, 3:39:52 PM12/1/10
to BSON
I initially experimented with storing a serialized BSON document as
bytes within an enclosing document that also had the signature because
I had the same concern. It's really the only way that I would know
that another language would be able to verify the signature.

(This is the same problem that XML-Sig and Salmon Magic Signatures
have. Both solve it in different ways, but they technically don't sign
serialized documents without some form of processing on the part of
the signing API.)

However, if you claim that different drivers store the same data
differently, then why would sorting by key solve the problem? Wouldn't
they still store a sorted document differently?

Juan Batiz-Benet

unread,
Dec 1, 2010, 7:44:52 PM12/1/10
to bs...@googlegroups.com

On Dec 1, 2010, at 12:39 PM, Andrew Rondeau wrote:

I initially experimented with storing a serialized BSON document as
bytes within an enclosing document that also had the signature because
I had the same concern. It's really the only way that I would know
that another language would be able to verify the signature.


Yeah, for signing, I agree with your approach. I'm a fan of not intermingling keys about a document within a document (as in, not storing the signature within the actual doc, as the signature is over the entire doc. 

With order well defined, the inner document wouldn't actually need to be bytes though, there would only be one possible byte representation for a given document!

(This is the same problem that XML-Sig and Salmon Magic Signatures
have. Both solve it in different ways, but they technically don't sign
serialized documents without some form of processing on the part of
the signing API.)

XML isn't a byte level format. It allows for all sorts of things (like whitespace and lack of order for tag parameters) that make equivalence checking impossible at the direct format representation. They require the processing you talk about. But BSON, being a binary format, doesn't need to! it's beautiful.

However, if you claim that different drivers store the same data
differently, then why would sorting by key solve the problem? Wouldn't
they still store a sorted document differently?

My point about drivers is not about intermediate storage (native dicts) per se, but in the actual BSON representation they output. Without a well-defined order, they will create unequal representation for equivalent documents.

Dwight Merriman: 
I've been thinking about what it would entail to preserve key order based on insertion. When going from BSON to some other format that does not naturally preserve insertion order (e.g. dictionaries in python or objC), it would be necessary to also carry along the (additional) order information. This would put extra load on the representation in other formats, and complicates transforming them to and from BSON. On the other hand, defining the order to be based on sorted key byte values* implicitly carries with it the necessary order information.

Cheers!

Juan

* not alphabetically, since alphabetic orders have the localization mess attached to them (i.e., sorting in Russian != soring in English), as Andrew pointed out to me earlier. By sorted byte values, i mean sorting the keys taken as pure ascii strings, regardless of encoding. It will not be perfect alphabetical ordering for the less common UTF-8 symbols, but removes any localization idiosyncrasies with sorting strings. 

Andrew Rondeau

unread,
Dec 2, 2010, 3:44:16 PM12/2/10
to BSON
And my point is that your proposed change will break actual live
implementations of BSON; nor is it needed for your use cases.

Andrew Rondeau

unread,
Dec 2, 2010, 4:25:56 PM12/2/10
to BSON
And my point is that your proposed change will break actual live
implementations of BSON; nor is it needed for your use cases.

martinkou

unread,
Dec 3, 2010, 12:41:29 PM12/3/10
to BSON
While I certainly agree you can make key ordering unimportant by
adding application-specific logic to overcome the ordering difference
- hell, for argument's sake, your signature algorithm can also work as
long as you make sure the key ordering is consistent at the moment you
sign and verify the signature, but send a randomly reordered BSON over
the network just for fun!

OTOH, you can't deny that the following logic is much simpler, and
should be much faster:
signature = sign(hash_alg, private_key, consistently_ordered_bson);
// send signature and bson over network
verify_result = verify(hash_alg, public_key,
consistently_ordered_bson, signature);

And the best part is.. I can explain this to a new programmer in 1
minute, and he'll totally understand it! I prefer to minimize the
amount of magic in my code. Magic has a tendency to break (esp. after
your "spell book" has 100kLOC+ of those), and takes time to explain
and maintain.

Now, I'm not advocating changing the BSON spec + driver right now and
breaking backward compatibility just to make signatures work. As it
appears now, this requirement is pretty much limited to a specific use
case. An option in the bson codec to allow the programmer to specify
the key order could be useful here. It can also be considered for
BSONv2. Networking is still an important use case IMHO.

Andrew Rondeau

unread,
Dec 6, 2010, 4:27:17 PM12/6/10
to BSON
On Friday at the MongoDB conference I asked a developer about this
topic. During the response, he brought up an interesting use case:

In MongoDB, you can use a BSON document as the primary key in a
collection. This acts as a technique to avoid adding indexes with
range queries, because you can do a range on the first entry in the
key.

His example went something like, you can use {State, City} as a
primary key, and then do queries like, {State: "CA"} directly on the
primary key. This feature would break if the spec changes.

Anyway, food for thought.
Reply all
Reply to author
Forward
0 new messages