In some applications the key order might be significant and defining
that BSON always have a pre-defined order would be harmful.
As for the usage in mongodb I would agree that documents should have
pre-defined order of keys (or always be sorted) before any comparison,
or storage. This means that many of the clients/drivers would not have
to be so careful about the order of the keys sent to the server, and
many problems and much confusion could be avoided. Unfortunately it
will negatively affect performance, and not just when storing the
data.
> This seems like more of a question for MongoDB and less for the BSON spec.
Not really. I am using BSON in various applications that do not use MongoDB. Trading objects in distributed systems benefits significantly from using hashes to stay in synch, rather than sending whole objects for comparison.
I'll give you one example: I'm working on a framework for parallelizing algorithms. We were using JSON, but since we care about performance, we may move to BSON. However the data exchange hinges on using sha hashes to avoid sending large datasets over multiple times.
> In some applications the key order might be significant and defining
> that BSON always have a pre-defined order would be harmful.
Could you give a concrete case of this?
> As for the usage in mongodb I would agree that documents should have
> pre-defined order of keys (or always be sorted) before any comparison,
> or storage. This means that many of the clients/drivers would not have
> to be so careful about the order of the keys sent to the server, and
> many problems and much confusion could be avoided. Unfortunately it
> will negatively affect performance, and not just when storing the
> data.
I think when dealing with sending queries to a server, the performance hit for ordering the query should be on the client side. I agree that it is a pain and confusing for users to have to specify queries or objects in a specific order. So what if the user specifies some query or object using a dictionary, not paying attention about the order, then the client code orders the keys according to the pre-established order (when dumping the dictionary into BSON), and delivers the (correctly ordered) BSON doc to the server. Thus, the server would see no performance hit from giving users the flexibility to specify keys in any order (at the dictionary level).
i'd be ok with making it explicit in BSON that bson is ordered.
i think having hte keys always sorted would work less well imo. sometimes, not always, people expect JSON to maintain key order : we wouldn't be able to do that anymore.
And how does your question apply to these applications? How should the
spec. change? Are you just asking for order-preserving processing for
implementations that touch the documents as they transform them?
Nothing in the spec defines comparison operations or ordering.
>> In some applications the key order might be significant and defining
>> that BSON always have a pre-defined order would be harmful.
>
> Could you give a concrete case of this?
Sure, take a document where metadata is in the front of the document
and there are embedded docs that the metadata applies to. It is much
more efficient to not have keys in some pre-defined order (like having
been sorted) where the metadata may end up at the end, not the front
because of reordering.
Imagine the mp3 format converted to bson, for example.
>> As for the usage in mongodb I would agree that documents should have
>> pre-defined order of keys (or always be sorted) before any comparison,
>> or storage. This means that many of the clients/drivers would not have
>> to be so careful about the order of the keys sent to the server, and
>> many problems and much confusion could be avoided. Unfortunately it
>> will negatively affect performance, and not just when storing the
>> data.
>
> I think when dealing with sending queries to a server, the performance hit for ordering the query should be on the client side. I agree that it is a pain and confusing for users to have to specify queries or objects in a specific order. So what if the user specifies some query or object using a dictionary, not paying attention about the order, then the client code orders the keys according to the pre-established order (when dumping the dictionary into BSON), and delivers the (correctly ordered) BSON doc to the server. Thus, the server would see no performance hit from giving users the flexibility to specify keys in any order (at the dictionary level).
In order to ensure the ordering you need to re-sort every place where
you don't know for sure that the order is correct, like data coming
from outside of your library/code. You cannot assume that the client
will send the "correctly ordered" BSON doc to the server. This will be
especially true as soon as the spec changes. For that matter, maybe
the doc should embed the spec version for compatibility, and to
address these issues.
At the moment order is preserved in the mongodb implementations
(server + drivers) and that order is important. For example, all
commands require the name of the command to be the first field.
{getLastError:1, fsync:1} is different than {fsync:1, getLastError:1};
the latter doesn't work.
As Dwight said, making that part of the spec makes sense and it clears
up the issue for those that implement the spec; it is also (mostly)
backwards compatible.
On Nov 28, 2010, at 10:03 AM, Scott Hernandez wrote:
> And how does your question apply to these applications? How should the
> spec. change? Are you just asking for order-preserving processing for
> implementations that touch the documents as they transform them?
Yes. Ensuring order-preservation would be a lot easier if the order was based on sorting, but, as Dwight mentioned, JSON prefers the order to be insert based.
> Nothing in the spec defines comparison operations or ordering.
Correct, so a host can turn a BSON doc into some other representation, then back into BSON and now the BSON representations are no longer the same, at the byte level. Given BSON's "binary object" nature, it makes most sense to simply compare the bytes directly. Requiring that the object be parsed is not good, imo.
> Sure, take a document where metadata is in the front of the document
> and there are embedded docs that the metadata applies to. It is much
> more efficient to not have keys in some pre-defined order (like having
> been sorted) where the metadata may end up at the end, not the front
> because of reordering.
meta-data keys are inherently different than regular data keys. If they're at the same level (same document), then why not prepend those keys with a symbol to denote they are special? Take for instance, mongodb's query operators, where a $ is prepended.
>
> Imagine the mp3 format converted to bson, for example.
I don't really see your point here. The entire mp3 file itself is a big array of Header/Data segments, which is not a problem as arrays in BSON already preserve order. whether the file should be in a 2nd level proper bson array, or be in the top level doc where the keys are the index (basically just how BSON arrays are currently implemented) is up for debate. For instance, i could see an mp3 BSON format like so:
>> { 'mp3data' : [ [ header, data ], [ header, data ], ... ] }
where data is a binary object, and header is either a document outlining the fields:
>> { 'version' : 1, 'layer' : 1, 'error' : 1, 'bitrate' : 160, 'frequency' : 0, 'pad' : 0, ' }
or (for space efficiency) an array (taking advantage of the well-defined order of the mp3 header fields):
>> [1, 1, 1, 160, 0, 0]
or (for MORE space efficiency) a binary object ... (but then what is the point of converting to BSON in the first place?!):
>> 0xFFFBA040 (including the sync word and with proper field widths)
Your example brings up a very good point: cases were a particular preconceived order is crucial tend to be well-defined formats (e.g. mp3, ip packets). These formats, wich are have a particular binary representation, translated to BSON will always see a space efficiency drop (BSON being schemaless). In addition, since the field order is well-defined and could be encoded either as an array (without field keys, maximizing space efficiency) or a document (maximizing ease of use, e.g. mp3data[0]['version'] = 1). if encoded as a dictionary, applications will be accessing based on keys, so the order of the keys in memory doesn't seem to matter.
> In order to ensure the ordering you need to re-sort every place where
> you don't know for sure that the order is correct, like data coming
> from outside of your library/code. You cannot assume that the client
> will send the "correctly ordered" BSON doc to the server. This will be
> especially true as soon as the spec changes. For that matter, maybe
> the doc should embed the spec version for compatibility, and to
> address these issues.
If order is part of the spec, then you CAN assume it. That's what prompted this request.
You bring up a very important point. Where could it go?
If it became part of the doc definition, then it would also appear in embedded docs, which is redundant. (allowing different bson version docs to be embedded seems like a huge mess in the long run.). Perhaps only the first doc must specify it, and embedded docs inherit when unspecified? this isn't clean...
prepending an additional byte to the whole bson representation would cause a lot of driver rewriting, but then again, a whole new version would anyway...
> At the moment order is preserved in the mongodb implementations
> (server + drivers) and that order is important. For example, all
> commands require the name of the command to be the first field.
>
> {getLastError:1, fsync:1} is different than {fsync:1, getLastError:1};
> the latter doesn't work.
Why not? is mongodb using pre-defined byte offsets? What is the benefit of using docs instead of arrays then? If it is only for readability, then what governs the ordering? semantics? As you mentioned earlier, i think requiring an order is confusing for users.
> As Dwight said, making that part of the spec makes sense and it clears
> up the issue for those that implement the spec; it is also (mostly)
> backwards compatible.
Sure, then drivers should not simply turn docs into the language's native dictionary, as these tend to not preserve insertion order.
imo it's a lot cleaner if the order were to be sorted. With insertion order, if i turn a BSON object into some other format and then turn it back, im screwed. format-to-format specific encoding (as opposed to format - native language dict - format) would need to be in place, which seems less than ideal going forward. Then again, substantial existing code assumes insertion order.
All drivers have support for OrderedMap/Hash/ListOfPairs (where the
insertion order is kept). If they don't then they can't generically
support commands. These are generally called a BSONDocument, or
Document.
Thus, as my library demonstrates, there's no need to sort
BsonDocuments by key order in order to hash correctly. Everything
should work assuming that implementations in other languages keep
items in documents in the correct order. Of course, if you don't
believe me, I'm perfectly willing to assist someone in porting Signed
BSON to another language that supports SHA256 signatures. ;)
> I wouldn't break applications because of poorly-written drivers. If a
> driver isn't maintaining key order, then it's not a problem with BSON.
> It's a problem with the driver. Fix the driver instead of trying to
> change the spec. I assume the driver is open-source?
The problem is not the drivers. The ambiguity persists even with ideal drivers; converting between formats yields unequal representations for equivalent documents.
Also:
I initially experimented with storing a serialized BSON document as
bytes within an enclosing document that also had the signature because
I had the same concern. It's really the only way that I would know
that another language would be able to verify the signature.
(This is the same problem that XML-Sig and Salmon Magic Signatures
have. Both solve it in different ways, but they technically don't sign
serialized documents without some form of processing on the part of
the signing API.)
However, if you claim that different drivers store the same data
differently, then why would sorting by key solve the problem? Wouldn't
they still store a sorted document differently?