[Django] #9682: icontains can be case-sensitive on MySQL

80 views
Skip to first unread message

Django

unread,
Nov 24, 2008, 1:51:37 PM11/24/08
to djang...@holovaty.com, django-...@googlegroups.com
#9682: icontains can be case-sensitive on MySQL
--------------------------------------------+-------------------------------
Reporter: to.roma.f...@qwertty.com | Owner: nobody
Status: new | Milestone:
Component: Database layer (models, ORM) | Version: 1.0
Keywords: | Stage: Unreviewed
Has_patch: 0 |
--------------------------------------------+-------------------------------
Firstly, this is a known problem, see groups.google.com/group/django-
users/browse_thread/thread/2936ad3387c4b369.

Secondly, I disagree with the ‘You set the database collation and Django
respects that’ approach shown there. For icontains and MySQL, Django
generates ‘x LIKE '%y%'’. So, the programmer has told Django explicitly,
‘I want case-insensitive comparison’, and Django tells MySQL, ‘We want
default comparison’. This is not field_icontains but rather some
field_''usingdefaultsettings''contains.

In fact, here’s how Django defines operators for the MySQL backend:
{{{
operators = {
'exact': '= %s',
'iexact': 'LIKE %s',
'contains': 'LIKE BINARY %s',
'icontains': 'LIKE %s',
'regex': 'REGEXP BINARY %s',
'iregex': 'REGEXP %s',
'gt': '> %s',
'gte': '>= %s',
'lt': '< %s',
'lte': '<= %s',
'startswith': 'LIKE BINARY %s',
'endswith': 'LIKE BINARY %s',
'istartswith': 'LIKE %s',
'iendswith': 'LIKE %s',
}
}}}
So, case-sensitivity is explicitly requested, while case-insensitivity is
implied.

The aforementioned thread says that ‘case-sensitive matching with the
default utf_*_ci collations’ and overall making ‘things like that work
regardless of collation’ is ‘really, really hard’. That sounds strange,
because case-sensitive matching is no harder than a simple memcpy, it’s
case-insensitive matching that is problematic.

However, MySQL doesn’t seem to implement the Unicode collation trickery
correctly. It thinks that 'ı' is LIKE 'i' and 'groß' is not LIKE 'GROSS'.
(It does perform LOWER('COMMITTED' COLLATE utf8_turkish_ci) correctly,
however.) What about altering the above table, making it 'COLLATE
utf8_general_ci LIKE %s' for cases where the default collation isn’t case-
insensitive?

You’re right Unicode is quite hard to implement properly. However, for
something_icontains the database does have to determine some collation,
and if the default collation is case insensitive, some fallback has to be
derived. Maybe the most reasonable solution would be a setting like
FORCE_COLLATION_FOR_CI (maybe MySQL-specific, maybe it can be usable for
other DBMSs), that would determine how to collate values for which no
collation is specified?

It would solve a practical problem: I want iexact to differ from exact and
icontains to differ from contains (please note these completely reasonable
requirements ''conflict'' on current MySQL under current Django
implementation ''regardless'' of collation settings). I’ve read what
Django documentation has to say about collation, and while you might be
justified in having ‘exact’ return what the database considers exact (for
example, for purposes of UNIQUE indices), it’s wrong not to provide a
lookup option that would consistently test for case-sensitive equality
across DBMSs. Maybe in addition to contains and icontains something like
scontains is also needed?

----

To summarize, I propose two things:

1. For those who have case-insensitive collation in MySQL: a lookup option
that would always invoke case-sensitive matching. Maybe the ‘s’ prefix for
‘sensitive’, or ‘b’ for binary: bcontains, bexact, bregex and so on.

2. For case-sensitive people: an option to set a collation for case-
insensitive matches, for example in the settings module.

Both options preserve backwards compatibility and both should be easy to
implement. (People on KDE bugzilla have told me I should never tell them
whether I think something is easy to implement, but nevertheless it’s a
couple of lines for both changes. :-)

--
Ticket URL: <http://code.djangoproject.com/ticket/9682>
Django <http://code.djangoproject.com/>
The Web framework for perfectionists with deadlines.

Django

unread,
Nov 24, 2008, 5:41:08 PM11/24/08
to djang...@holovaty.com, django-...@googlegroups.com
#9682: icontains can be case-sensitive on MySQL
-----------------------------------------------------+----------------------
Reporter: to.roma.f...@qwertty.com | Owner: nobody
Status: new | Milestone:
Component: Database layer (models, ORM) | Version: 1.0
Resolution: | Keywords:
Stage: Someday/Maybe | Has_patch: 0
Needs_docs: 0 | Needs_tests: 0
Needs_better_patch: 0 |
-----------------------------------------------------+----------------------
Changes (by kmtracey):

* needs_better_patch: => 0
* stage: Unreviewed => Someday/Maybe
* needs_tests: => 0
* needs_docs: => 0

Comment:

You've left out of your estimate the lines of doc required to explain the
new lookup types and settings you propose, and the estimated person-hours
per year needed to answer questions related to them on their mailing list
and IRC, and to deal with the resulting tickets when someone sets things
up incorrectly, or wants to use different collations for different
tables/columns, etc. :-)

If we were to fix MySQL exact/iexact lookup, this would not be the
approach I would take. I think new lookup types and settings would just
add to the confusion. Rather I'd prefer to make the existing lookups work
properly regardless of database collation.

That's not particularly hard to do either, in terms of lines of code. One
way I experimented with doing it was adding a new call into the database
backend to get an appropriate collation string for a given lookup type and
column datatype. On all DBs except MySQL this was a no-op, but MySQL
returned (for string-type data) either the utf8 binary or case-insensitive
collation as appropriate for the given lookup type. (Given where the
collate was placed it was OK to always use a utf8 collation since Django
always uses utf8 on the DB connection.) This worked correctly but had
hideous performance on my DB, which happens to use latin1 encoding, since
the specification of a utf8 collation for a latin1 column resulted in
inability to use indexes for the lookups (see last comment on #8102).

At that point I dropped looking into it, since I don't actually need
anything beyond what Django does today for my own use -- it was mainly an
academic exercise to see if I couldn't get both lookup types working
properly regardless of collation on MySQL. If I were to look at it again
I'd be seeing if I can't somehow get from MySQLdb what the collation in
use is and only specify an alternate (utf8) one if necessary, or see if
different placement of collate in the statement and use of a collation
that matched the DB charset in use didn't cause such a performance hit,
etc. That is I'd prefer to put the complexity of fixing this into the
code rather than push it off onto the user in the form of alternative
lookup types and settings.

I'm going to put this into Someday/Maybe because this is a wart that would
be nice to fix. But (I believe) we need a solution that doesn't add new
lookup types or settings, performs well for MySQL default configurations,
and does not add too much complexity to the database layer (I do not even
know if my collate lookup call would have been accepted).

--
Ticket URL: <http://code.djangoproject.com/ticket/9682#comment:1>

Django

unread,
Nov 25, 2008, 6:56:35 AM11/25/08
to djang...@holovaty.com, django-...@googlegroups.com
#9682: icontains can be case-sensitive on MySQL
-----------------------------------------------------+----------------------
Reporter: to.roma.f...@qwertty.com | Owner: nobody
Status: new | Milestone:
Component: Database layer (models, ORM) | Version: 1.0
Resolution: | Keywords:
Stage: Someday/Maybe | Has_patch: 0
Needs_docs: 0 | Needs_tests: 0
Needs_better_patch: 0 |
-----------------------------------------------------+----------------------
Comment (by to.roma.f...@qwertty.com):

Thanks for the quick reply.

I think that three lookup types are actually needed. One case-sensitive,
one case-insensitive, and one that uses the default database settings.
Consider:
{{{
if model.objects.filter(pk="TeSt").count() == 0:
...
model.objects.create(pk="TeSt", ...)
}}}
Here the intent is to check whether a record with the given PK would be
allowed by the database engine, regardless of whether the field has case-
sensitive collation or not. Neither _exact nor _inexact would work for
sure.

By the way, what about cutting this Gordian knot by simply making _exact
strictly case-sensitive, _inexact strictly insensitive, and if none is
specified, using the database settings? This would be compatible with
existing code that simply says get(fieldname="string"), it would differ in
_exact becoming completely exact (this is what could break code, but
hardly much of it).

Another issue I seem not to have expressed clearly is the collation for
fields that are case-sensitive by default when case-insensitive matching
is requested. There’s nowhere for MySQL to derive that from. Suppose I
look for status_iexact="committed"; should "COMMITTED" match? It should
unless Turkish collation is needed, and even then one might argue that
while "Türkiye" = "TÜRKİYE", "committed" != "COMMİTTED" but rather
"committed" = "COMMITTED" (the rather well-known SVN bug). So the perfect
solution would be a way to specify a collation for each string lookup, but
until then at least a way to specify global case-insensitive collation
like my patch does is required. You see, we do need a way to be able to
compare utf8_bin fields case-insensitively, Django doesn’t provide a
single means to do that.

--
Ticket URL: <http://code.djangoproject.com/ticket/9682#comment:2>

Django

unread,
Nov 25, 2008, 7:07:39 AM11/25/08
to djang...@holovaty.com, django-...@googlegroups.com
#9682: icontains can be case-sensitive on MySQL
-----------------------------------------------------+----------------------
Reporter: to.roma.f...@qwertty.com | Owner: nobody
Status: new | Milestone:
Component: Database layer (models, ORM) | Version: 1.0
Resolution: | Keywords:
Stage: Someday/Maybe | Has_patch: 0
Needs_docs: 0 | Needs_tests: 0
Needs_better_patch: 0 |
-----------------------------------------------------+----------------------
Comment (by mtredinnick):

Please go back and look at the long history of this issue in other tickets
and on the django-developers mailing list before suggesting that "hardly
breaking much code" is something we should do. Firstly, breaking any code
isn't an option (there's no way for people to tell that there code is
suddenly broken and it's more or less locked in by our API stability
guarantee). Secondly, there are lots of subtle corners here, as you'll see
from related tickets in Trac and they are precisely the worst cases to
break in any way.

The two big problems are that MySQL's collation stuff is a little awkward
to work with and the MySQLdb wrapper isn't currently able to handle things
properly. We have to work with the restrictions we have.

--
Ticket URL: <http://code.djangoproject.com/ticket/9682#comment:3>

Django

unread,
Nov 25, 2008, 7:47:53 AM11/25/08
to djang...@holovaty.com, django-...@googlegroups.com
#9682: icontains can be case-sensitive on MySQL
-----------------------------------------------------+----------------------
Reporter: to.roma.f...@qwertty.com | Owner: nobody
Status: new | Milestone:
Component: Database layer (models, ORM) | Version: 1.0
Resolution: | Keywords:
Stage: Someday/Maybe | Has_patch: 0
Needs_docs: 0 | Needs_tests: 0
Needs_better_patch: 0 |
-----------------------------------------------------+----------------------
Comment (by to.roma.f...@qwertty.com):

Let me get it very straight and simple.

1. Currently either exact = iexact or contains = icontains, there’s no
setting under which all four would be distinct, which is definitely an
error.

2. I propose adding one entry to the settings module, the patch above has
perfect backward compatibility and makes icontains behave reasonably on
utf8_bin fields.

3. It would be great to be able to specify collation explicitly for all
kinds of lookups, for example, to order records differently for German
("Ö" < "P") and Swedish ("Ö" > "Z") users, but I understand it needs
design decisions and can be hard to implement. (It could look like
‘country_iexact=collate("TÜRKİYE", "tr_TR.UTF-8")’ and
‘order_by=[collate("last_name", "de_CH.UTF-8")]’.)

Yes, I did look at other issues and I searched the mailing lists. Of other
issues #7789 is like this one, but the result of it was ‘Changed "exact"
matches in MySQL to use the database's native collation’ and that bug was
about having _exact do the same comparison the database does internally
for PK, but what I request here is, simply put, a way of using icontains
for utf8_bin fields—while, as I said in the bug description, having _exact
be CI or CS depending on configuration can be acceptable, _icontains must
be CI as the documentation describes it. Don’t you find the definitions of
exact, iexact, contains and icontains in the code excerpt above not
completely correct?

--
Ticket URL: <http://code.djangoproject.com/ticket/9682#comment:4>

Django

unread,
Apr 7, 2009, 7:40:42 PM4/7/09
to djang...@holovaty.com, django-...@googlegroups.com
#9682: icontains can be case-sensitive on MySQL
-----------------------------------------------------+----------------------
Reporter: to.roma.f...@qwertty.com | Owner: nobody
Status: new | Milestone:
Component: Database layer (models, ORM) | Version: 1.0
Resolution: | Keywords:
Stage: Someday/Maybe | Has_patch: 0
Needs_docs: 0 | Needs_tests: 0
Needs_better_patch: 0 |
-----------------------------------------------------+----------------------
Comment (by mykhal):

I strongly vote for this. It solves the problem how to insert "COLLATE
<collation>" into the query, invoked by normal filters (without writing
SQL from scratch).

Why would one need to use COLLATE? E.g. in case, when he has a database
containing strings with accented characters, and this database is in the
different collation than utf8_general_ci - e.g. utf8_czech_ci. In this
case, sql select e.g. "{{{SELECT ... LIKE %cviceni% ...}}}" does not hit
strings containing "cvičení", but it does, when "{{{SELECT ... COLLATE
utf8_general_ci LIKE %cviceni% ...}}}" . What is interesting, if {{{ORDER
BY}}} is used, results are ordered by the original database collation
(czech in this case), what might be exactly what he needs.

--
Ticket URL: <http://code.djangoproject.com/ticket/9682#comment:5>

Django

unread,
Sep 20, 2010, 10:08:45 AM9/20/10
to djang...@holovaty.com, django-...@googlegroups.com
#9682: icontains can be case-sensitive on MySQL
-----------------------------------------------------+----------------------
Reporter: to.roma.f...@qwertty.com | Owner: nobody
Status: new | Milestone:
Component: Database layer (models, ORM) | Version: 1.0
Resolution: | Keywords:
Stage: Someday/Maybe | Has_patch: 0
Needs_docs: 0 | Needs_tests: 0
Needs_better_patch: 0 |
-----------------------------------------------------+----------------------
Comment (by jakub):

Specifying COLLATE in comparisons and ORDER BY clause is a very common
MySQL pattern when dealing with languages with accented characters. I'd
love see this in Django.

--
Ticket URL: <http://code.djangoproject.com/ticket/9682#comment:6>
Reply all
Reply to author
Forward
0 new messages