> 1. How good is the unicode support? (ability to store Japanese,
> Korean, Chinese characters without corrupting data)
Mongo is completely unicode safe. So you can store any unicode string
just fine. We basically just treat it as bytes, so always give you
back what you give us. No manipulations or anything. The drivers are
responsible for making sure they give us unicode.
> 2. Does it provide unicode string search capability? (use case: enter
> a japanese string and get the original english string back)
if you mean matching, then yes - not really search, just lookup. I
think thats what you mean though, i.e. just finding an exact matching
for a japanese string.
We also have unicode regex search support via pcre.
> 3. I was earlier implementing this in C# + MSSQL but am at my wits end
> with the OR mappings between C# and MSSQL. The other option is to go
> with PHP (am very comfortable with it) + MongoDB? Any suggestions? I
> wonder if MongoDB will work with C#?
The PHP driver should work with you. I'm sure it would work in C#,
but there is no native C# driver, nor have we tested it. There is a
c++ driver that might work, but don't enough about C# to be sure. We
would love to have a community C# driver, just hasn't happened yet.
> Some of these questions might seem very basic to you guys but please
> bear with me. :)
> My main concern is good unicode string handling and support. Thanks!
No problem, hopefully it'll work well for you - I think it will.
-Eliot
Hello Kavita,
In MongoDB strings are stored in UTF-8 BSON format. As as result, in MongoDB you can store and retrieve Indic language characters.
For example I tested CRUD operations and $regex operator on documents with marathi words in MongoDB on v2.6 above.
> db.version()
2.6.11
>
>
> db.coll.insert({ word : "लवकर" })
WriteResult({ "nInserted" : 1 })
> db.coll.insert({ word : "निजे" })
WriteResult({ "nInserted" : 1 })
> db.coll.insert({ word : "लवकर" })
WriteResult({ "nInserted" : 1 })
> db.coll.insert({ word : "उठे" })
WriteResult({ "nInserted" : 1 })
> db.coll.insert({ word : "तया" })
WriteResult({ "nInserted" : 1 })
> db.coll.insert({ word : "ज्ञान" })
WriteResult({ "nInserted" : 1 })
> db.coll.insert({ word : "सुख" })
WriteResult({ "nInserted" : 1 })
> db.coll.insert({ word : "समृद्धी" })
WriteResult({ "nInserted" : 1 })
> db.coll.insert({ word : "भेटे" })
WriteResult({ "nInserted" : 1 })
>
Using find function also works fine. The below code outputs two documents as expected.
> db.coll.find({ word : "लवकर"})
{ "_id" : ObjectId("57f79060ac150ccd922b290e"), "word" : "लवकर" }
{ "_id" : ObjectId("57f79060ac150ccd922b2910"), "word" : "लवकर" }
Document removal also works in similar manner:
> db.coll.count()
9
> db.coll.remove({word : "लवकर"})
WriteResult({ "nRemoved" : 2 })
> db.coll.count()
7
>
You can also use $regex operator to search for regular expressions like below:
> db.coll.find({ word : { $regex : /स/ }})
{ "_id" : ObjectId("57f79060ac150ccd922b2914"), "word" : "सुख" }
{ "_id" : ObjectId("57f79060ac150ccd922b2915"), "word" : "समृद्धी" }
MongoDB v3.4 will have Collation and Case-Insensitive Indexes that support language-specific rules for string comparisons. I can see hindi in the list of supported-languages for this.
Regards,
Nishant