Schema design in MongoDB: supporting multiple languages

Rudolf Bargholz

unread,

May 6, 2011, 3:01:58 AM5/6/11

to mongod...@googlegroups.com

Hi,

I am evaluating and reading up on MongoDB. It seems as if MongoDB advises to store all data regarding an object in one record. Coming from Switzerland, we have to support at least French, German, Italian and English in our application, but I would like to support many more languages. The general question I have is what is the advisable route for general schema design of documents in order to store and retrieve multiple language information.

I could have a simplified country structure as in the following

{
    country:{
        code2:'CH',
        nrlang:2,
        countryname:[
            {
                name:'Schweiz',
                lang:'ch_de'
            },
            {
                name:'Switzerland',
                lang:'en_us',
                default:1
            }
        ]
    }
}

In this case I would be able to find all documents that have language dependent information, i.e. "nrlang >0", and find all documents that have not yet been translated, i.e. if I previously had two languages and now have four languages then I could search for all documents "nrlang<4" in order to find those documents that need to be translated. If my app does not find a translation for my current language, then I can search for a default and use this for the display string.

The problem I see here is, this would bloat the document the more languages I add to a document. I the case above this would not be that much of a problem, but for other documents with lots of translatable information this might be an issue.

Another option would be to store all translatable information in a separate document and reference this document in the structure above. But this would require two calls to MongoDB each time I want to display a translatable string in my app, one for the code, and one to the translation document.

Another option I can think of is that I have one document per language, e.g.

{
country:{
code2:'CH',

name:'Schweiz',
lang:'ch_de'

}
}

{
country:{
code2:'CH',

name:'Switzerland',
lang:'en_us',

default:1

}
}

Here I can foresee problems on the client side fetching all unique countries if not all countries have been translated yet. Also, it makes creating a unique contraint on the code2 difficult in order to prevent inadvertent duplication of country codes.

Are there any developers out there using MongoDB that have had to solve a similar issue and would be prepared to share some thoughts on this issue?

What general schema design do you use when working with multiple languages?

How do you find out which documents have not yet been translated?

How do you handle many languages on the client side?

As a side issue: how do you solve carriagereturns/linefeeds in strings?

Probably the answer will be "it depends" :-) but I would really appreciate a MongoDB view here, with me coming from a relational DB background.

Regards

Rudolf Bargholz

Ken Egozi

unread,

May 6, 2011, 8:18:45 AM5/6/11

to mongod...@googlegroups.com

my 0.02CHF

1. instead of an array of {lang, name} tuples, you can have an embedded document with lang as key, and name as value. set the default in the parent doc.

e.g.:

country:{
code2:'CH',
nrlang:2,

default:'us_en',
countryname:{
'ch_de': 'Schweiz',

'us_en': 'Switzerland'

}

As for taking large translateable strings into separate docs - that makes sense actually.

small stuff (names etc.) are things you do read in batches, so embedding them makes perfect sense.

Long strings (product description etc.) usually are displayed to a user one at a time anyway, so the overhead of extra single document to load would probably be negligible.

Ken Egozi.
http://www.kenegozi.com/blog
http://www.delver.com
http://www.musicglue.com
http://www.castleproject.org
http://www.idcc.co.il - הכנס הקהילתי הראשון למפתחי דוטנט - בואו בהמוניכם

--
You received this message because you are subscribed to the Google Groups "mongodb-user" group.
To post to this group, send email to mongod...@googlegroups.com.
To unsubscribe from this group, send email to mongodb-user...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/mongodb-user?hl=en.

Rudolf Bargholz

unread,

May 6, 2011, 10:48:45 AM5/6/11

to mongod...@googlegroups.com

Hi Ken,

Thanks for the input. I had found a similar example as you post in the mean time, having the language codes as "column identifiers", but I was still wondering how to then find out which language was missing. Still trying to work out how this can be achieved in Scala.

Also, it probably makes more sense to store an MD5 checksum (langchk="...") instead of the nr of languages, so that I can issue a search for all documents that do not have the current correct langchk MD5MD5, in order to find those documents whose language records need to be updated. Create a list of the language codes available, sort these in an ascending list, then make a string out of these values and make an MD5. If I follow this route for all translatable objects, then it becomes simple to find those documents that need to be updated. When adding a new language to an application, an automated update process would copy the translation value of the default language or a predefined source language to all documents whose langchk does not match, update the langchk and set "correctionpending". This "correctionpending" could then be used to allow an application to export all texts that need a pending correction, or use this to indicate to a user that corrections to records might have to be made. A version number, incremented for each update, could be used to ensure a document is not inadvertantly updated by two users at the same time.

country:{
code2:'CH',
langchk:'76be951cd73ec1bde005399a6c7907ba', <- based on the string 'de_ch;en_uk;en_us'

version=2 <- incremented the version number of the document.

default:'us_en',
countryname:{
'ch_de': 'Schweiz',

'us_en': 'Switzerland',

'us_uk': 'Switzerland' <- this would be the newly added translation

},

correctionpending:{
'us_uk': 'Switzerland'

}

Anyway, thanks for your input. I am slowly getting an idea how to work saftely with documents.

Rudolf Bargholz

Tony Hannan

unread,

May 6, 2011, 2:05:00 PM5/6/11

to mongod...@googlegroups.com

I think you guys have the right approach. The only thing I would add is isolate the concept of an "international string" to its own type / sub-document. For example:

country: {code2: 'CH', name: IString}

where IString = {

langchk: ...

version: 2

default: 'us_en'

translations: {

ch_de: 'Schweiz',

us_en: 'Switzerland'

},

pending: {...}

}

Cheers,

Tony

Rudolf Bargholz

unread,

May 7, 2011, 2:40:07 AM5/7/11

to mongod...@googlegroups.com

Hi Tony,

Thanks for your reply.

If my requirement would be to build a list of country codes and the names for a specific language, to display the list in a dropdown, splitting up the translatable strings into a separate collection would require the client side code to fetch all counties

db.country.find()

and then iterate through the country list, fetching each translation document and determining the correct translation string.

Is this premise correct, and normal for working with MongoDB, or is there a solution to this type of general case that I am not aware of?

Regards

Rudolf

Tony Hannan

unread,

May 9, 2011, 5:59:50 AM5/9/11

to mongod...@googlegroups.com

Hi Rudolf,

1. You could fetch each associated document seperately, eg: istring.find({_id: X}).

2. You could fetch all associated documents at once, eg: istring.find({_id: {$in: Xs}}), where Xs is an array of ids.

3. Or you could embed the istring in each country like I showed previously.

In the second case, you fetch all your countries, extract the istring id from each one into an array, do the $in query described above, then join your countries with their istrings. This is called a "client-side join".

Cheers,

Tony

--

Reply all

Reply to author

Forward