#9682: icontains can be case-sensitive on MySQL
--------------------------------------------+-------------------------------
Reporter:
to.roma.f...@qwertty.com | Owner: nobody
Status: new | Milestone:
Component: Database layer (models, ORM) | Version: 1.0
Keywords: | Stage: Unreviewed
Has_patch: 0 |
--------------------------------------------+-------------------------------
Firstly, this is a known problem, see
groups.google.com/group/django-
users/browse_thread/thread/2936ad3387c4b369.
Secondly, I disagree with the ‘You set the database collation and Django
respects that’ approach shown there. For icontains and MySQL, Django
generates ‘x LIKE '%y%'’. So, the programmer has told Django explicitly,
‘I want case-insensitive comparison’, and Django tells MySQL, ‘We want
default comparison’. This is not field_icontains but rather some
field_''usingdefaultsettings''contains.
In fact, here’s how Django defines operators for the MySQL backend:
{{{
operators = {
'exact': '= %s',
'iexact': 'LIKE %s',
'contains': 'LIKE BINARY %s',
'icontains': 'LIKE %s',
'regex': 'REGEXP BINARY %s',
'iregex': 'REGEXP %s',
'gt': '> %s',
'gte': '>= %s',
'lt': '< %s',
'lte': '<= %s',
'startswith': 'LIKE BINARY %s',
'endswith': 'LIKE BINARY %s',
'istartswith': 'LIKE %s',
'iendswith': 'LIKE %s',
}
}}}
So, case-sensitivity is explicitly requested, while case-insensitivity is
implied.
The aforementioned thread says that ‘case-sensitive matching with the
default utf_*_ci collations’ and overall making ‘things like that work
regardless of collation’ is ‘really, really hard’. That sounds strange,
because case-sensitive matching is no harder than a simple memcpy, it’s
case-insensitive matching that is problematic.
However, MySQL doesn’t seem to implement the Unicode collation trickery
correctly. It thinks that 'ı' is LIKE 'i' and 'groß' is not LIKE 'GROSS'.
(It does perform LOWER('COMMITTED' COLLATE utf8_turkish_ci) correctly,
however.) What about altering the above table, making it 'COLLATE
utf8_general_ci LIKE %s' for cases where the default collation isn’t case-
insensitive?
You’re right Unicode is quite hard to implement properly. However, for
something_icontains the database does have to determine some collation,
and if the default collation is case insensitive, some fallback has to be
derived. Maybe the most reasonable solution would be a setting like
FORCE_COLLATION_FOR_CI (maybe MySQL-specific, maybe it can be usable for
other DBMSs), that would determine how to collate values for which no
collation is specified?
It would solve a practical problem: I want iexact to differ from exact and
icontains to differ from contains (please note these completely reasonable
requirements ''conflict'' on current MySQL under current Django
implementation ''regardless'' of collation settings). I’ve read what
Django documentation has to say about collation, and while you might be
justified in having ‘exact’ return what the database considers exact (for
example, for purposes of UNIQUE indices), it’s wrong not to provide a
lookup option that would consistently test for case-sensitive equality
across DBMSs. Maybe in addition to contains and icontains something like
scontains is also needed?
----
To summarize, I propose two things:
1. For those who have case-insensitive collation in MySQL: a lookup option
that would always invoke case-sensitive matching. Maybe the ‘s’ prefix for
‘sensitive’, or ‘b’ for binary: bcontains, bexact, bregex and so on.
2. For case-sensitive people: an option to set a collation for case-
insensitive matches, for example in the settings module.
Both options preserve backwards compatibility and both should be easy to
implement. (People on KDE bugzilla have told me I should never tell them
whether I think something is easy to implement, but nevertheless it’s a
couple of lines for both changes. :-)
--
Ticket URL: <
http://code.djangoproject.com/ticket/9682>
Django <
http://code.djangoproject.com/>
The Web framework for perfectionists with deadlines.