Unicode collations / Sorting / Indexing: How to implement?

858 views
Skip to first unread message

Florian Sesser

unread,
Nov 17, 2011, 7:31:31 PM11/17/11
to mongo...@googlegroups.com
Hi all!

For a now-international web project, we [1] need mightier sorting than
what MongoDB has to offer at the moment.

We'd like to bend our open source muscles and see whether we can add
something to MongoDB that can make it upstream.

Related ticket:
https://jira.mongodb.org/browse/SERVER-1920 (Sort by collation)

Semi-related:
https://jira.mongodb.org/browse/SERVER-90 (case insensitive index)

How would you (10gen or other readers of this list) implement this?
What's important to you in such a patch so it can eventually make it
upstream?

I have not yet looked into the code, but in my naïvité [2] I'd think we
have to patch ICU [3] or sth. similar into the place where MongoDB sorts
and the place where it builds indices. For simplicity, I propose to set
the collation as a property of a collection (and thus NOT allow the
building of different language indices for now). For us, in the very
beginning and as a proof of concept, even basic support for a fixed
generic UTF8 collation would be OK (such as MySQL's "utf8_general_ci").
PostgreSQL uses the OS's locale [4]. That also may be an option for MongoDB?

Looking forward to your insights!

Thanks,

Florian


[1] http://www.it-agenten.com/
[2] Gotta love Unicode :)
[3] http://site.icu-project.org/
[4] http://www.postgresql.org/docs/9.1//interactive/locale.html

Eliot Horowitz

unread,
Nov 18, 2011, 1:36:37 AM11/18/11
to mongo...@googlegroups.com
I think we would absolutely need document level collation in a collection.
That's actually the most common case.

> --
> You received this message because you are subscribed to the Google Groups "mongodb-dev" group.
> To post to this group, send email to mongo...@googlegroups.com.
> To unsubscribe from this group, send email to mongodb-dev...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/mongodb-dev?hl=en.
>
>

Florian Sesser

unread,
Nov 18, 2011, 5:04:10 AM11/18/11
to mongo...@googlegroups.com
On Fri, November 18, 2011 07:36, Eliot Horowitz wrote:
> I think we would absolutely need document level collation in a collection.
> That's actually the most common case.

OK.

Do you see any major brain breakers on the way? (other than doing that
sort of thing /right/ never is easy in the first place)

Flo

Nat

unread,
Nov 18, 2011, 5:09:54 AM11/18/11
to mongo...@googlegroups.com
To do that right, it probably would needs to introduce an additional dependency to unicode library like icu and file/collection change to database. It won't be that simple.

Florian Sesser

unread,
Nov 18, 2011, 8:24:42 AM11/18/11
to mongo...@googlegroups.com
On Fri, November 18, 2011 11:09, Nat wrote:
> To do that right, it probably would needs to introduce an additional
> dependency to unicode library like icu

Obviously.

Also I don't know about MongoDB policies regarding #ifdef cluttering the
code. In my experimental (POC) branch, I would leave out the option to
make this configurable at compile time. Can be added later if you like our
patch.

> and file/collection change to database.

Yes. Do you have any concrete hints for us?

We do not intend to change storage. Our main goal is sorting of strings.

> It won't be that simple.

Yes. Looking into change logs of other database systems quickly shows that
many a great men made many not-so-stupid mistakes when tackling this
problem.

Still, MongoDB will need this feature sooner or later, and we need it
sooner. I would be deeply grateful if you could point out major obstacles
you see from the top of your head.

Florian


Florian Sesser

unread,
Nov 21, 2011, 3:37:59 AM11/21/11
to mongo...@googlegroups.com
On 11/18/2011 07:36 AM, Eliot Horowitz wrote:
> I think we would absolutely need document level collation in a collection.
> That's actually the most common case.

Hmm. Giving the thing some further thought, I'd of course like this to
not be harder than it has to be.

I do understand why it makes sense to specify collation column-wise in
SQL-DBMSs, but not why I should specify a collation for an entire
document. Specifying it per key (in each document) seems the right
granularity, but I do not want to change the data structures of MongoDB
(and introduce some inheritance scheme etc).

I just want Mongo to sort international strings.

Maybe giving the collation as a parameter to sort() and to ensureIndex()
is the right way after all?
Another advantage of this would be explicitness. I'm not keen on
implementing all the error handling for "you're trying to sort two keys
where one of the 1M records has a unique (different) collation".

Again, thanks for your thoughts!

Florian

Eliot Horowitz

unread,
Nov 23, 2011, 3:07:52 AM11/23/11
to mongo...@googlegroups.com
Giving it to sort is problematic because then you can't index effectively.

Florian Sesser

unread,
Nov 23, 2011, 5:00:45 AM11/23/11
to mongo...@googlegroups.com, el...@10gen.com
On 11/23/2011 09:07 AM, Eliot Horowitz wrote:
> Giving it to sort is problematic because then you can't index effectively.

One would also have to give it to ensureIndex() (or whatever you do to
trigger index creation), and build one index per collation.

Or is there a reason why this will not work? Will this interfere with
find()s etc. using that (collated) index?

Thanks,

Florian

Eliot Horowitz

unread,
Nov 23, 2011, 2:00:48 PM11/23/11
to mongo...@googlegroups.com
If there are a number of collations, this would be very very inefficient.

Florian Sesser

unread,
Nov 24, 2011, 2:48:56 PM11/24/11
to mongo...@googlegroups.com, Eliot Horowitz
On Wed, November 23, 2011 20:00, Eliot Horowitz wrote:
> If there are a number of collations, this would be very very inefficient.

Sorry if I was unclear. Of course you would have to specify the
collation(s) you are interested in, and build the indices only for these.

For our use case, "one index / one collation" would be fine. (Meaning: I
don't see why we should need differently collated indices on the same data
with the same order.)

You:


>>> Giving it to sort is problematic because then you can't index
>>> effectively.

??

Me:


>> Or is there a reason why this will not work? Will this interfere with
>> find()s etc. using that (collated) index?

Why would indexing be more complicated?

Thank you a lot for your time, Eliot!

Eliot Horowitz

unread,
Nov 25, 2011, 10:15:31 AM11/25/11
to mongo...@googlegroups.com
I think for most use cases, you need many collations on a single collection.
So I think any solution has to work for that.
Ideally be specifying the collation per document.

Graham Barr

unread,
Nov 25, 2011, 10:31:56 AM11/25/11
to mongo...@googlegroups.com
On Fri, Nov 25, 2011 at 9:15 AM, Eliot Horowitz <el...@10gen.com> wrote:
> I think for most use cases, you need many collations on a single collection.
> So I think any solution has to work for that.
> Ideally be specifying the collation per document.

Restricting a document to be all in one collation seems restrictive
and having to specify a collation per field per document seems way
overkill.

Also, when building a btree index you really want all the data to be
of the same collation, at least for a given field. Although a
collection can have documents of different structures, if I add an
index on field foo, then there is probably some commonality between
the foo fields of all documents I expect to place into that
collection.

I would suggest to add options to ensureIndex to specify the collation
the index will use for each column in that index.

If a user wants the ability to search using different collations, then
they can simply create multiple indexes and pass a hint of which to
use.

Graham.

Florian Sesser

unread,
Nov 25, 2011, 3:39:31 PM11/25/11
to mongo...@googlegroups.com, gmb...@gmail.com, el...@10gen.com
Thanks Graham! This is exactly what I had in mind.

This way should give a relatively simple user interface. Also, no change
to Mongo's data structures should be necessary.

Anyone anymore thoughts on this? Is this a good way to do it? Not? Why
not? Any problems you already foresee? Tips?

Thanks,

Florian

Eliot Horowitz

unread,
Nov 25, 2011, 3:41:23 PM11/25/11
to mongo...@googlegroups.com
I really don't think that model is going to work.

Think of an internationalized app where there are lots of collations.

You wouldn't want 20 indexes that are basically the same.

Florian Sesser

unread,
Nov 25, 2011, 4:03:31 PM11/25/11
to mongo...@googlegroups.com, gmb...@gmail.com, el...@10gen.com
On 25.11.2011 21:41, Eliot Horowitz wrote:
> I really don't think that model is going to work.
> Think of an internationalized app where there are lots of collations.
> You wouldn't want 20 indexes that are basically the same.

I do not see why I would want differently collated indices on the same
field.

Swedish collation for German text makes no sense at all, for example.

Please explain!

Thanks,

Florian

Eliot Horowitz

unread,
Nov 25, 2011, 4:46:55 PM11/25/11
to mongo...@googlegroups.com
If you have a collection that is used by everyone globally, then any
given field will have text from different languages.

The simplest case, a users collection.

You'll have different first names from the globe, and you may want to
sort differently.

Graham Barr

unread,
Nov 26, 2011, 9:18:43 AM11/26/11
to mongo...@googlegroups.com
On Fri, Nov 25, 2011 at 3:46 PM, Eliot Horowitz <el...@10gen.com> wrote:
> If you have a collection that is used by everyone globally, then any
> given field will have text from different languages.
>
> The simplest case, a users collection.
>
> You'll have different first names from the globe, and you may want to
> sort differently.

I think for any solution you can come up with scenarios that it would
not work for.

The question is what do people think would be the most common use and
the most useful. Collation is a difficult thing and one solution is
not going to solve all problems.

IMO, the most frequent scenario would be a field, all of the same collation.

However, this is only looking at the case where the sort has a
relevant index for the specified key pattern. If there is no index,
then the require collation should be specified in the query sort
criteria.


Graham.

Florian Sesser

unread,
Nov 28, 2011, 5:12:58 PM11/28/11
to mongo...@googlegroups.com
Full ACK to what Graham said.

Eliot:


> The simplest case, a users collection.
> You'll have different first names from the globe, and you may want to
> sort differently.

I gave this a little more thought. In fact, I think it speaks FOR my
proposal.

In the simple case (one index, one collation) I would have to choose
something like utf8_generic_ci (cf. MySQL) and get something a lot
better than the current behavior of MongoDB. It would work well for
German and French and probably also Arabic languages, but not Swedish.
This would suffice for us at the moment.

AFAIK this would resemble the behavior of current RDBMSs (IIANM MySQL
and Postgres), where the finest level to define collations is per-column.

Also note that defining a collation per-document would not help here,
because the DB cannot know how to collate two documents with different
collation setting. Postgres, for example, raises an error in this case [1].

The more elaborate case (multiple indices with different collations)
would of course cost more processing power per insert (for updating
multiple indices), but give you the possibility of quick sorting. For
many use cases, like your users collection, which is mostly read and
only seldom inserted to, this might work out well.

Of course, if I talk shit, please stop me. I hate DBs.

Best,

Florian

[1] http://www.postgresql.org/docs/9.1/static/collation.html

Artem Khodyush

unread,
Nov 28, 2011, 9:42:35 PM11/28/11
to mongo...@googlegroups.com
Hi,

On Mon, Nov 28, 2011 at 2:12 PM, Florian Sesser <fl...@cosetrain.com> wrote:

IAFAIK this would resemble the behavior of current RDBMSs (IIANM MySQL

and Postgres), where the finest level to define collations is per-column.
 
Also note that defining a collation per-document would not help here,
because the DB cannot know how to collate two documents with different
collation setting. Postgres, for example, raises an error in this case [1].

Precisely, 'cannot know to collate' means that almost 
any operation involving two strings of different collations 
will give you an error. Strictly speaking, there is no well-defined 
way to order collection of documents of mixed collations. 

Some databases, MS SQL for example, give you a way 
out of this - you can use 'collation casts', that is, specify 
collation to use for any string expression in-place,  pretty
much anywhere a string expression can appear. 
One obvious place is in ORDER BY clause - it won't be able 
to use index, but still it covers the case for sorting the same data 
using different collations at different times - supposedly Swedish 
users won't care much about the order of non-Swedish names.
 

 
The more elaborate case (multiple indices with different collations)
would of course cost more processing power per insert (for updating
multiple indices), but give you the possibility of quick sorting. For
many use cases, like your users collection, which is mostly read and
only seldom inserted to, this might work out well.


This could be generalized - if multiple indices for the same data are 
supported, why limit them to collations? This would be so-called
'virtual', or 'expression' indices, where you specify an expression
at index creation time which is evaluated over changed or inserted
documents and gives a key to use for the index. This was implemented
in some pre-SQL databases (maybe in some RDBMS too, but I'm 
not sure). I can imagine use cases where this would come in handy.

Just my 2c,
Artem

 
Reply all
Reply to author
Forward
0 new messages