Considering using MongoDB for storing internationalized strings.

ananth

unread,

May 14, 2009, 3:29:28 AM5/14/09

to mongodb-user

Hi Guys,

I was reading up about different key-value based databases and came
across MongoDB. I have earlier been using MySQL for the backend (i
guess most of us start off with it). I have never worked with a key-
value db but it sure seems to be a much more logical fit for me as a
programmer.

My problem is that I wanted to make a LAN accessible website to store/
display all the translations for English strings. Currently each
English string has translations in 15+ languages and soon the number
will be increased to 20+ languages. There are about 60,000 english
strings. I was considering giving a go at MongoDB (am in the initial
stages) and would take to it with open arms (anything to get out of
RDBMS dbs). I just have a few questions:
1. How good is the unicode support? (ability to store Japanese,
Korean, Chinese characters without corrupting data)
2. Does it provide unicode string search capability? (use case: enter
a japanese string and get the original english string back)
3. I was earlier implementing this in C# + MSSQL but am at my wits end
with the OR mappings between C# and MSSQL. The other option is to go
with PHP (am very comfortable with it) + MongoDB? Any suggestions? I
wonder if MongoDB will work with C#?

Some of these questions might seem very basic to you guys but please
bear with me. :)
My main concern is good unicode string handling and support. Thanks!

- Ananth

Kunthar

unread,

May 14, 2009, 7:12:54 AM5/14/09

to mongod...@googlegroups.com

Read da sweet docs :)
http://www.mongodb.org/display/DOCS/

Eliot

unread,

May 14, 2009, 7:13:29 AM5/14/09

to mongod...@googlegroups.com

Hi,

> 1. How good is the unicode support? (ability to store Japanese,
> Korean, Chinese characters without corrupting data)

Mongo is completely unicode safe. So you can store any unicode string
just fine. We basically just treat it as bytes, so always give you
back what you give us. No manipulations or anything. The drivers are
responsible for making sure they give us unicode.

> 2. Does it provide unicode string search capability? (use case: enter
> a japanese string and get the original english string back)

if you mean matching, then yes - not really search, just lookup. I
think thats what you mean though, i.e. just finding an exact matching
for a japanese string.

We also have unicode regex search support via pcre.

> 3. I was earlier implementing this in C# + MSSQL but am at my wits end
> with the OR mappings between C# and MSSQL. The other option is to go
> with PHP (am very comfortable with it) + MongoDB? Any suggestions? I
> wonder if MongoDB will work with C#?

The PHP driver should work with you. I'm sure it would work in C#,
but there is no native C# driver, nor have we tested it. There is a
c++ driver that might work, but don't enough about C# to be sure. We
would love to have a community C# driver, just hasn't happened yet.

> Some of these questions might seem very basic to you guys but please
> bear with me. :)
> My main concern is good unicode string handling and support. Thanks!

No problem, hopefully it'll work well for you - I think it will.

-Eliot

ananth

unread,

May 14, 2009, 7:43:21 AM5/14/09

to mongodb-user

Hi Eliot,

Thanks a bunch, yes that is what I meant about Unicode support!
I think the Unicode regex thingy would work for me for any advanced
search cases.

Kunthar, I had tried searching for Unicode support in the Docs but
didn't get anything so asked in the forum. :)

Me be happy, will try out MongoDB. :D

- Ananth

Jim Mulholland

unread,

May 14, 2009, 9:39:35 AM5/14/09

to mongodb-user

I can vouch for international searching with Mongo.

Here is a Twitter aggregator we created for RailsConf based on our
Floxee platform which uses MongoDB as a back end. Notice the search
expression is 2 different characters separated with a | which is a
regular expression "OR". I have no idea what language this is,
though. ;-)

http://railsconf.floxee.com/tweetstream?q=%E4%BA%8B%7C%E3%81%97

Eliot

unread,

May 14, 2009, 9:55:16 AM5/14/09

to mongod...@googlegroups.com

Just a quick clarification:
Mongo supports UTF-8, not unicode.

The drivers all talk UTF-8 to the database.

Ananth Deodhar

unread,

May 15, 2009, 12:01:12 AM5/15/09

to mongod...@googlegroups.com

Hmm, I will check that out, usually we store EA languages (japanese, korean, simp. chinese) as UTF-16. But even UTF-8 should do fine. I'll just use some dummy values to check that out.

Jim, that certainly looks good. That puts most of my doubts to rest. :)
Thanks!

- Ananth

dwight_10gen

unread,

May 15, 2009, 3:45:08 PM5/15/09

to mongodb-user

It's using UTF-8 so that strings are packed efficiently for common
characters; obviously there are some pros and cons vs. UTF-16. Give
it a try and let us know if you have issues, the goal is that the
database works well globally.

P.S. Short term it is possible to stick UTF-16 and unicode in BSON
BinData type in MongoDB objects. BinData is queryable and sortable
(albeit the sort order is a bit naive, memcmp() order) -- so there are
some possibilities there if needed short term, especially if one wrote
a cuople little helper functions that sat atop the existing driver.

On May 15, 12:01 am, Ananth Deodhar <ananthdeod...@gmail.com> wrote:
> Hmm, I will check that out, usually we store EA languages (japanese, korean,
> simp. chinese) as UTF-16. But even UTF-8 should do fine. I'll just use some
> dummy values to check that out.
>
> Jim, that certainly looks good. That puts most of my doubts to rest. :)
> Thanks!
>
> - Ananth
>

Kavita Moholkar

unread,

Sep 29, 2016, 2:09:31 AM9/29/16

to mongodb-user, ananth...@gmail.com

Hello Sir,

For my on going research i require to store indian language (i.e marathi) in mongodb.How can I do it?

Nishant Bhardwaj

unread,

Oct 10, 2016, 1:35:07 AM10/10/16

to mongodb-user, ananth...@gmail.com

Hello Kavita,

In MongoDB strings are stored in UTF-8 BSON format. As as result, in MongoDB you can store and retrieve Indic language characters.

For example I tested CRUD operations and $regex operator on documents with marathi words in MongoDB on v2.6 above.

> db.version()
2.6.11
> 
> 
> db.coll.insert({ word : "लवकर" })
WriteResult({ "nInserted" : 1 })
> db.coll.insert({ word : "निजे" })
WriteResult({ "nInserted" : 1 })
> db.coll.insert({ word : "लवकर" })
WriteResult({ "nInserted" : 1 })
> db.coll.insert({ word : "उठे" })
WriteResult({ "nInserted" : 1 })
> db.coll.insert({ word : "तया" })
WriteResult({ "nInserted" : 1 })
> db.coll.insert({ word : "ज्ञान"  })
WriteResult({ "nInserted" : 1 })
> db.coll.insert({ word : "सुख" })
WriteResult({ "nInserted" : 1 })
> db.coll.insert({ word : "समृद्धी" })
WriteResult({ "nInserted" : 1 })
> db.coll.insert({ word : "भेटे" })
WriteResult({ "nInserted" : 1 })
>

Using find function also works fine. The below code outputs two documents as expected.

> db.coll.find({ word : "लवकर"})
{ "_id" : ObjectId("57f79060ac150ccd922b290e"), "word" : "लवकर" }
{ "_id" : ObjectId("57f79060ac150ccd922b2910"), "word" : "लवकर" }

Document removal also works in similar manner:

> db.coll.count()
9
> db.coll.remove({word : "लवकर"})
WriteResult({ "nRemoved" : 2 })
> db.coll.count()
7
>

You can also use $regex operator to search for regular expressions like below:

> db.coll.find({ word : { $regex : /स/ }})
{ "_id" : ObjectId("57f79060ac150ccd922b2914"), "word" : "सुख" }
{ "_id" : ObjectId("57f79060ac150ccd922b2915"), "word" : "समृद्धी" }

MongoDB v3.4 will have Collation and Case-Insensitive Indexes that support language-specific rules for string comparisons. I can see hindi in the list of supported-languages for this.