For a now-international web project, we [1] need mightier sorting than
what MongoDB has to offer at the moment.
We'd like to bend our open source muscles and see whether we can add
something to MongoDB that can make it upstream.
Related ticket:
https://jira.mongodb.org/browse/SERVER-1920 (Sort by collation)
Semi-related:
https://jira.mongodb.org/browse/SERVER-90 (case insensitive index)
How would you (10gen or other readers of this list) implement this?
What's important to you in such a patch so it can eventually make it
upstream?
I have not yet looked into the code, but in my naïvité [2] I'd think we
have to patch ICU [3] or sth. similar into the place where MongoDB sorts
and the place where it builds indices. For simplicity, I propose to set
the collation as a property of a collection (and thus NOT allow the
building of different language indices for now). For us, in the very
beginning and as a proof of concept, even basic support for a fixed
generic UTF8 collation would be OK (such as MySQL's "utf8_general_ci").
PostgreSQL uses the OS's locale [4]. That also may be an option for MongoDB?
Looking forward to your insights!
Thanks,
Florian
[1] http://www.it-agenten.com/
[2] Gotta love Unicode :)
[3] http://site.icu-project.org/
[4] http://www.postgresql.org/docs/9.1//interactive/locale.html
> --
> You received this message because you are subscribed to the Google Groups "mongodb-dev" group.
> To post to this group, send email to mongo...@googlegroups.com.
> To unsubscribe from this group, send email to mongodb-dev...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/mongodb-dev?hl=en.
>
>
OK.
Do you see any major brain breakers on the way? (other than doing that
sort of thing /right/ never is easy in the first place)
Flo
Obviously.
Also I don't know about MongoDB policies regarding #ifdef cluttering the
code. In my experimental (POC) branch, I would leave out the option to
make this configurable at compile time. Can be added later if you like our
patch.
> and file/collection change to database.
Yes. Do you have any concrete hints for us?
We do not intend to change storage. Our main goal is sorting of strings.
> It won't be that simple.
Yes. Looking into change logs of other database systems quickly shows that
many a great men made many not-so-stupid mistakes when tackling this
problem.
Still, MongoDB will need this feature sooner or later, and we need it
sooner. I would be deeply grateful if you could point out major obstacles
you see from the top of your head.
Florian
Hmm. Giving the thing some further thought, I'd of course like this to
not be harder than it has to be.
I do understand why it makes sense to specify collation column-wise in
SQL-DBMSs, but not why I should specify a collation for an entire
document. Specifying it per key (in each document) seems the right
granularity, but I do not want to change the data structures of MongoDB
(and introduce some inheritance scheme etc).
I just want Mongo to sort international strings.
Maybe giving the collation as a parameter to sort() and to ensureIndex()
is the right way after all?
Another advantage of this would be explicitness. I'm not keen on
implementing all the error handling for "you're trying to sort two keys
where one of the 1M records has a unique (different) collation".
Again, thanks for your thoughts!
Florian
One would also have to give it to ensureIndex() (or whatever you do to
trigger index creation), and build one index per collation.
Or is there a reason why this will not work? Will this interfere with
find()s etc. using that (collated) index?
Thanks,
Florian
Sorry if I was unclear. Of course you would have to specify the
collation(s) you are interested in, and build the indices only for these.
For our use case, "one index / one collation" would be fine. (Meaning: I
don't see why we should need differently collated indices on the same data
with the same order.)
You:
>>> Giving it to sort is problematic because then you can't index
>>> effectively.
??
Me:
>> Or is there a reason why this will not work? Will this interfere with
>> find()s etc. using that (collated) index?
Why would indexing be more complicated?
Thank you a lot for your time, Eliot!
Restricting a document to be all in one collation seems restrictive
and having to specify a collation per field per document seems way
overkill.
Also, when building a btree index you really want all the data to be
of the same collation, at least for a given field. Although a
collection can have documents of different structures, if I add an
index on field foo, then there is probably some commonality between
the foo fields of all documents I expect to place into that
collection.
I would suggest to add options to ensureIndex to specify the collation
the index will use for each column in that index.
If a user wants the ability to search using different collations, then
they can simply create multiple indexes and pass a hint of which to
use.
Graham.
This way should give a relatively simple user interface. Also, no change
to Mongo's data structures should be necessary.
Anyone anymore thoughts on this? Is this a good way to do it? Not? Why
not? Any problems you already foresee? Tips?
Thanks,
Florian
Think of an internationalized app where there are lots of collations.
You wouldn't want 20 indexes that are basically the same.
I do not see why I would want differently collated indices on the same
field.
Swedish collation for German text makes no sense at all, for example.
Please explain!
Thanks,
Florian
The simplest case, a users collection.
You'll have different first names from the globe, and you may want to
sort differently.
I think for any solution you can come up with scenarios that it would
not work for.
The question is what do people think would be the most common use and
the most useful. Collation is a difficult thing and one solution is
not going to solve all problems.
IMO, the most frequent scenario would be a field, all of the same collation.
However, this is only looking at the case where the sort has a
relevant index for the specified key pattern. If there is no index,
then the require collation should be specified in the query sort
criteria.
Graham.
Eliot:
> The simplest case, a users collection.
> You'll have different first names from the globe, and you may want to
> sort differently.
I gave this a little more thought. In fact, I think it speaks FOR my
proposal.
In the simple case (one index, one collation) I would have to choose
something like utf8_generic_ci (cf. MySQL) and get something a lot
better than the current behavior of MongoDB. It would work well for
German and French and probably also Arabic languages, but not Swedish.
This would suffice for us at the moment.
AFAIK this would resemble the behavior of current RDBMSs (IIANM MySQL
and Postgres), where the finest level to define collations is per-column.
Also note that defining a collation per-document would not help here,
because the DB cannot know how to collate two documents with different
collation setting. Postgres, for example, raises an error in this case [1].
The more elaborate case (multiple indices with different collations)
would of course cost more processing power per insert (for updating
multiple indices), but give you the possibility of quick sorting. For
many use cases, like your users collection, which is mostly read and
only seldom inserted to, this might work out well.
Of course, if I talk shit, please stop me. I hate DBs.
Best,
Florian
[1] http://www.postgresql.org/docs/9.1/static/collation.html
IAFAIK this would resemble the behavior of current RDBMSs (IIANM MySQL
and Postgres), where the finest level to define collations is per-column.
Also note that defining a collation per-document would not help here,
because the DB cannot know how to collate two documents with different
collation setting. Postgres, for example, raises an error in this case [1].
The more elaborate case (multiple indices with different collations)
would of course cost more processing power per insert (for updating
multiple indices), but give you the possibility of quick sorting. For
many use cases, like your users collection, which is mostly read and
only seldom inserted to, this might work out well.