Feature request: Unicode collation algorithm in django

459 views
Skip to first unread message

Ed Hagen

unread,
Feb 1, 2012, 2:27:46 PM2/1/12
to Django developers
Hi,

Let me preface this request by saying that when it comes to django,
I'm an advanced beginner (so this might be a dumb request).

The motivation for my request involved users of a django-based
database of international scholars who wanted their names sorted
"correctly." I explained that different languages sorted characters
differently, and therefore there was no single correct sort order, but
I promised to see if I could easily implement language-specific
orderings. What I found was that django seems to rely on the database
for this feature:

https://docs.djangoproject.com/en/dev/ref/databases/#collation-settings

which (if I've understood things correctly) makes sense for
performance reasons, but makes it more difficult to change things on
the fly, e.g., to provide language-specific ordering.

Using suggestions on this page:

http://stackoverflow.com/questions/1097908/how-do-i-sort-unicode-strings-alphabetically-in-python

I fixed things well-enough for my present purposes, but I thought it
would be useful to abstract this capability away from the database,
with django itself providing some version of the Unicode collation
algorithm:

http://unicode.org/reports/tr10/

This might hook into django's internationalization and localization
features, and/or be accessible at a lower level, e.g., with a keyword
argument to, or variant of, "order_by".

Rationale: any django developer that needs to display sorted lists to
international users could probably benefit from this feature.

Thanks,

Ed.

Łukasz Rekucki

unread,
Feb 1, 2012, 4:56:30 PM2/1/12
to django-d...@googlegroups.com
On 1 February 2012 20:27, Ed Hagen <eha...@gmail.com> wrote:
> Hi,
>
> Let me preface this request by saying that when it comes to django,
> I'm an advanced beginner (so this might be a dumb request).
>
> The motivation for my request involved users of a django-based
> database of international scholars who wanted their names sorted
> "correctly." I explained that different languages sorted characters
> differently, and therefore there was no single correct sort order, but
> I promised to see if I could easily implement language-specific
> orderings. What I found was that django seems to rely on the database
> for this feature:
>
> https://docs.djangoproject.com/en/dev/ref/databases/#collation-settings
>
> which (if I've understood things correctly) makes sense for
> performance reasons, but makes it more difficult to change things on
> the fly, e.g., to provide language-specific ordering.

Performance is the main concern here. Any query with ordering on a
text field would have to fetch all results and sort it on the
application side. It's just terrible.

Weirdly enough, I was looking at this thread lately, trying to explain
to a beginner, why Python doesn't provide an easy way to do this,
which actually works :(. Summary of options:

1) Use PyICU - this would solve a lot of problems (some which Django
already solves by itself). But it's quite a big dependency on an
external package (written in C++, so I guess it won't run on PyPy,
Jython nor App Engine). Django currently has no external dependencies
and that's good :)

2) Use the "locale" module: it will work... if you have all the
possible locales compiled on your OS... and you're not running on
Windows... or using threads. AFAIK, switching locale is also quite
slow.

3) Use some other listed libraries: none of them looks like maintained
by authors.

4) Write UCA ourselves from scratch. This involves including 1.6MB
collation table in Django.

All those solutions of course, still have the problem of needing all
the data on the application side.

> I fixed things well-enough for my present purposes, but I thought it
> would be useful to abstract this capability away from the database,
> with django itself providing some version of the Unicode collation
> algorithm:
>
> http://unicode.org/reports/tr10/
>
> This might hook into django's internationalization and localization
> features, and/or be accessible at a lower level, e.g., with a keyword
> argument to, or variant of, "order_by".

Could you describe your current solution in more detail?

--
Łukasz Rekucki

Ed Hagen

unread,
Feb 1, 2012, 7:15:17 PM2/1/12
to Django developers

Łukasz, Thanks for your comments.

> Could you describe your current solution in more detail?

Well, my use of the term "solution" was perhaps a bit generous. First,
I educated my users why this problem was hard to solve. Second, I
picked a single collation that everyone could live with. Third, I used
the locale module to sort on the application side, which, as you say,
was slow. But because the data don't change much, I was able to cache
the result to achieve acceptable performance.

Even if application-side sorting is slow, nevertheless, in many cases
that will be preferable to users not finding what they are looking for
because the item doesn't show up where it "should." So it's a tradeoff
between two aspects of user experience (speed vs. expected sort
order). In many cases, that tradeoff would favor language-specific
collation.

My two cents.

Ed.

Anssi Kääriäinen

unread,
Feb 1, 2012, 9:34:03 PM2/1/12
to Django developers
On Feb 2, 2:15 am, Ed Hagen <eha...@gmail.com> wrote:
> £ukasz, Thanks for your comments.
The problem is that application side sorting does not scale. So, if
you are depending on application side sorting, there will likely come
a day when you have so much data that it is simply impossible to do
the sorting Python-side. Databases are a lot faster at sorting,
especially with indexes.

Now, my proposed solution would be to have some way of doing:
SELECT name, ...
FROM authors
ORDER BY name collate 'fi';

That some way might be something like
.order_by('name', collate='fi')
or maybe
.collate('fi').order_by('name')
and now collate would be in effect for filters (that is,
name__gte='e'), too.

Making Django's ORM do the above isn't the most trivial thing. And not
all databases support collate clauses.

You can do the above with .extra() even now if your DB happens to
support collations. Default collation for your database might also be
an option for your particular problem. At least in PostgreSQL versions
prior to 9.1 you have one collation for the DB, which you can set only
at CREATE DB time.

Anyways, Django's design decision is to do sorting in the DB. And I
think that is a good decision.

- Anssi

Łukasz Rekucki

unread,
Feb 2, 2012, 3:03:48 AM2/2/12
to django-d...@googlegroups.com
On 2 February 2012 03:34, Anssi Kääriäinen <anssi.ka...@thl.fi> wrote:
>
> Now, my proposed solution would be to have some way of doing:
> SELECT name, ...
> FROM authors
> ORDER BY name collate 'fi';
>
> That some way might be something like
> .order_by('name', collate='fi')
> or maybe
> .collate('fi').order_by('name')
> and now collate would be in effect for filters (that is,
> name__gte='e'), too.
>

The user should probably be able to specifiy the collation only for
the fields he wants, as it most likely uses a different type of index
and is more expensive then a standard ordering, so I like the
.collate(name="fi") option (and a shortcut of .collate("fi") to apply
to all text fields.

>
> Making Django's ORM do the above isn't the most trivial thing. And not
> all databases support collate clauses.

After a quick research, I think they actually do now (at least in
ORDER BY). MySQL[1] and SQLite[2] both have COLLATE, in Oracle we
could use NLSSORT()[3] which is something like locale.strxfrm.

>
> You can do the above with .extra() even now if your DB happens to
> support collations. Default collation for your database might also be
> an option for your particular problem. At least in PostgreSQL versions
> prior to 9.1 you have one collation for the DB, which you can set only
> at CREATE DB time.
>

[1]: At least since 5.0:
http://dev.mysql.com/doc/refman/5.0/en/charset-collate.html
[2]: SQLite doesn't support UCA by default, but lets you define any
collation: http://docs.python.org/library/sqlite3.html#sqlite3.Connection.create_collation
[3]: http://docs.oracle.com/cd/B19306_01/server.102/b14225/ch9sql.htm#i1006311

--
Łukasz Rekucki

Łukasz Rekucki

unread,
Feb 2, 2012, 3:07:42 AM2/2/12
to django-d...@googlegroups.com
2012/2/2 Łukasz Rekucki <lrek...@gmail.com>:

> On 2 February 2012 03:34, Anssi Kääriäinen <anssi.ka...@thl.fi> wrote:
>>
>> Now, my proposed solution would be to have some way of doing:
>> SELECT name, ...
>> FROM authors
>> ORDER BY name collate 'fi';
>>
>> That some way might be something like
>> .order_by('name', collate='fi')
>> or maybe
>> .collate('fi').order_by('name')
>> and now collate would be in effect for filters (that is,
>> name__gte='e'), too.
>>
>
> The user should probably be able to specifiy the collation only for
> the fields he wants, as it most likely uses a different type of index
> and is more expensive then a standard ordering, so I like the
> .collate(name="fi") option (and a shortcut of .collate("fi") to apply
> to all text fields.

Just came to my mind, that we could just mimic the DBs and have a
Collate operator (like Q, F, Count, etc.) + maybe some defaults on the
model:

M.objects.order_by(Collate("name", "uca"))
M.objects.filter(name__gte=Collate('e', "fi"))

--
Łukasz Rekucki

Anssi Kääriäinen

unread,
Feb 2, 2012, 7:00:44 AM2/2/12
to Django developers
On Feb 2, 10:07 am, Łukasz Rekucki <lreku...@gmail.com> wrote:
> Just came to my mind, that we could just mimic the DBs and have a
> Collate operator (like Q, F, Count, etc.) + maybe some defaults on the
> model:
>
> M.objects.order_by(Collate("name", "uca"))
> M.objects.filter(name__gte=Collate('e', "fi"))

I have been thining a lot about allowing annotate and order_by to
accept basically anything that has an .as_sql() method. So that you
could do things like:
qs.annotate(name_upper=RawSQL("upper(%s)",
params=f('name')).order_by('name_upper')
of course, the above example is doable using current Django ORM.
or maybe:
qs.annotate(name_upper=RawSQL("upper(%s)",
params=f('name')).order_by(RawSQL('%s collate "%%s" desc nulls last',
params=(f('name_upper'), 'fi-FI'))

These would be much better than the .extra(), as aliases are relabled
properly and you can "chain" the RawSQL clauses. In the ORM the code
that deals with .extra handling is kinda hacky, and this would
probably make that part of the ORM cleaner.

I think those really are doable. Using objects in cols, order by etc.
inside the ORM instead of the current implementation would probably
make the ORM cleaner and faster. Of course, without patch this is easy
to claim...

Problem is, I don't have time to do anything about this right now, and
it seems the ORM-knowing core developers have the same problem.

- Anssi
Reply all
Reply to author
Forward
0 new messages