Encoding

Lonny K

unread,

Oct 7, 2010, 4:13:44 PM10/7/10

to BSON

Why does the BSON spec. say strings are UTF-8.

Why does it matter?

Kyle Banker

unread,

Oct 7, 2010, 4:21:43 PM10/7/10

to bs...@googlegroups.com

We believe, as do many, that UTF-8 is the sanest way to encode Unicode strings. Beyond that, we want strings to be consistently encoded, since it simplifies the code that operates on those strings, performing sorts, comparisons, etc.

Scott Hernandez

unread,

Oct 7, 2010, 4:22:27 PM10/7/10

to bs...@googlegroups.com

On Thu, Oct 7, 2010 at 1:13 PM, Lonny K <lon...@gmail.com> wrote:
> Why does the BSON spec. say strings are UTF-8.
>
> Why does it matter?

Two words: Character Encoding (http://en.wikipedia.org/wiki/Character_encoding)

http://en.wikipedia.org/wiki/Comparison_of_Unicode_encodings

Justin Dearing

unread,

Oct 7, 2010, 4:31:43 PM10/7/10

to bs...@googlegroups.com

Kyle,

Playing devils advocate, a text file with no byte order market (BOM) is assumed to be UTF-8. In theory if string encoding were determined by BOM in BSON, like it would be for a text file, nothing would break.

Explicitly I'm suggesting that for UTF-16 and UTF-32 a BOM be placed in front of every string value in a JSON document.

Now, I'd never make use of such a feature, and I agree UTF-8 is the way to go. I also realize this might lead to the evil UTF-8 BOMs getting added into "normal" bson documents. However, if some people want to do it, I say let them.

Justin

Dwight Merriman

unread,

Oct 7, 2010, 6:20:10 PM10/7/10

to bs...@googlegroups.com

it's a fair question.

the Java BSON library will take these strings and do utf-8 -> unicode java string conversions, and vice-versa. so it is assuming there is a certain encoding there.

as mentioned elsewhere in this thread, maybe future incarnations of bson have some enhancements in this area.

On Thu, Oct 7, 2010 at 4:13 PM, Lonny K <lon...@gmail.com> wrote:

Reply all

Reply to author

Forward