Custom collation support

Tom Carrick

unread,

May 27, 2020, 8:18:25 AM5/27/20

to django-d...@googlegroups.com

I think it would be useful to be able to create collations and use them with model fields. The motivation here is mostly that citext is somewhat discouraged in favour of creating a collation. I'm not sure how this would work on other databases,or what the API would look like. I didn't see a ticket for it or another discussion, but maybe there is one.

Perhaps a Collation class would make sense, and it could be added to a Field with a new parameter. I'm not sure how easy it would be to only create a collation once, so perhaps it would need a CreateCollation migration as well, but I know very little about the internals of migrations.

Cheers,

Tom

Tom Carrick

unread,

Jul 17, 2020, 7:56:08 AM7/17/20

to django-d...@googlegroups.com

I've had a deeper look at this now and think I have an API proposal. First, the state of supported vendors:

1. All vendors support adding a collation to text/varchar fields.

2. The syntax is more or less the same.

3. However, the collation names themselves are different.

4. PostgreSQL is the only vendor that allows creating custom collations at runtime.

So I'm thinking we add a new `collation` parameter to `CharField` and `TextField`, that simply takes a string of the collation ID. I'm not quite sure on the implementation as I don't know the ORM that well, but my naive approach would be to just add a new format string to the `data_types` dict that is calculated during the field __init__(), either an empty string to use the default collation, or e.g. ' collate <collation_id>'. There's may well be a better approach.

By using this - because the collation names are not the same across vendors - the user is saying "I'm okay with this only working on one database vendor", so there should be a warning in the docs. There is perhaps some scope in the future to make this take a callable that can figure out the collation per-database. This would be useful for getting case-insensitive lookups working across all backends, for example. But I want to keep that out of the scope because it's some extra work and I'm not sure on the implementation.

Another downside is that people like to use CharField as a base class for other column types that might not support collations, but I think this should be in the user's hands to make sure they aren't doing that.

We should also add a `CreateCollation` operation for Postgres, similar to the `CreateExtension` operation that currently exists. If the user wants to use a custom collation they must create it first, similar to using extensions currently.

The advantage to this is that users can use collations without having to make SQL migrations, which I think would be nice. The really nice thing is the ability to have case-insensitive lookups that work across all database vendors, rather than only Postgres as it currently is. And as I mentioned in the previous message, Postgres is discouraging our current method of using the citext extension in favour of this approach.

Cheers,

Tom

Tom Carrick

unread,

Jul 18, 2020, 7:29:41 AM7/18/20

to django-d...@googlegroups.com

I've written a proof of concept: https://github.com/django/django/pull/13207

The diff is quite small, though I'm not sure if there's something I'm doing wrong - this is my first foray into schema migration internals so I'm learning as I read the codebase.

Tom

Aymeric Augustin

unread,

Jul 19, 2020, 3:28:23 AM7/19/20

to django-d...@googlegroups.com

Hello Tom,

Just wanted to give you some encouragement here, as you've been monologuing for two months ;-) The PR looks promising!

Regarding migrations, I'm not seeing how the collation for a column will be set or unset in the database if you change it in the code e.g.:

MySQL : ALTER TABLE ... MODIFY ... VARCHAR(...) COLLATE ...

PostgreSQL : ALTER TABLE ... ALTER COLUMN ... TYPE varchar(...) COLLATE ...

I think this needs to be in scope, else it will be hard for maintainers of existing projects to take advantage of this new feature.

--

Aymeric.

--
You received this message because you are subscribed to the Google Groups "Django developers (Contributions to Django itself)" group.
To unsubscribe from this group and stop receiving emails from it, send an email to django-develop...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/django-developers/CAHoz%3DMbyJ1qLNCxmTh6JdsP-e1UoYwdWJm8KL3q8MgRAQaD5sA%40mail.gmail.com.

Adam Johnson

unread,

Jul 19, 2020, 3:55:44 AM7/19/20

to django-d...@googlegroups.com

Yes I'd also like to lend my support.

In MySQL (at least) columns also have a charset property that goes hand-in-hand with which collations are available. I know it expands the scope, but I think it would be good to get that in here. Perhaps it could be a follow-up ticket though.

To view this discussion on the web visit https://groups.google.com/d/msgid/django-developers/7441F0D5-B658-4FC5-BA13-C183C470A3CB%40polytechnique.org.

--

Adam

Tom Carrick

unread,

Jul 19, 2020, 9:57:58 AM7/19/20

to django-d...@googlegroups.com

Thanks for the feedback.

Aymeric, yes, I left out modification until I knew there was some interest as that code seemed more impenetrable to me than the field addition. I've added this now, does it seem like the right approach? I've tested it on everything but Oracle and it seems to work as I'd expect. If it generally looks good (and when/if the ticket is accepted) I'll write up tests and docs... I'm not sure how tests are going to work though. Since the collation names are different I think I need to write some interesting migrations that depend on the database vendor... do we do anything like this anywhere else?

Adam, I want to go with a followup ticket for these reasons:

1. I'm new to the schema migration code and I don't want to take on too much at once.

2. I haven't used MySQL in a long time so I'm not sure how to test it.

3. The code here works across all backends, I'm not sure how it'd work for the charset. Would we add a new kwarg to Char/TextField just for MySQL? I think there are a few more questions to answer before tackling it, but I agree that if we add collations, we should add charsets for MySQL too, as they're so closely related.

Tom

To view this discussion on the web visit https://groups.google.com/d/msgid/django-developers/CAMyDDM1pmp7Cw5igDaOf6KbHOYWY_HdgE550ia8W7X_9iSRWRQ%40mail.gmail.com.

charettes

unread,

Jul 19, 2020, 10:48:08 PM7/19/20

to Django developers (Contributions to Django itself)

Thanks for your work so far Tom!

> There is perhaps some scope in the future to make this take a callable that can figure out the collation per-database. This would be useful for getting case-insensitive lookups working across all backends, for example. But I want to keep that out of the scope because it's some extra work and I'm not sure on the implementation.

I like where it's heading but I wonder if it'd be more appropriate to start by supporting a _compilable_ object instead of a string at first. By compilable I mean an object that has an as_sql(connection) method to allow per-backend logic to be applied. For example, one might define Collation(casesensitive=False) compilable that translates to `NOCASE` on SQLite, and equivalent on other databases. I'm just throwing that out there, maybe there's no sane of making that work in a database agnostic way without adding much and we should start by only allowing a string as you've suggested.

Another question I had is whether or not the option should be named db_collation like we do for db_index, db_column, and db_tablespace since it's a database only option. It feels like using the db_ prefix would make most sense here particularly if we only allow string collations to be specified at first.

> I'm new to the schema migration code and I don't want to take on too much at once.

I'd focus your PR on solely adding the option to specify a column collation in this case. The contrib.postgres migration operations to create and delete collation could be useful but it increases the scope of your proposed changed and will slow down its review and affect its mid-term mergeability. Given it can be achieved using RunSQL operations only the Field.db_collation extension point is

+1 to what Aymeric said regarding field alterations. Adding or removing a collation from a CharField/TextField should generate an AlterField operation that adds or remove the COLLATE clause from the column definition. Looks like you already implemented the schema editor bits of it but it should also be tested which will possibly involve writing some database introspection bit to assert the operation was correctly performed.

> In MySQL (at least) columns also have a charset property that goes hand-in-hand with which collations are available. I know it expands the scope, but I think it would be good to get that in here. Perhaps it could be a follow-up ticket though.

> The code here works across all backends, I'm not sure how it'd work for the charset. Would we add a new kwarg to Char/TextField just for MySQL? I think there are a few more questions to answer before tackling it, but I agree that if we add collations, we should add charsets for MySQL too, as they're so closely related.

I guess it should be handled in a different ticket and discussed before hand. Not sure how common it is to want to define a per-column charset and since it would be already indirectly achievable by defining a collation (MySQL charset defaults to the charset of the collation when only the latter is specified, not sure about Oracle) it feels like the addition of a top level CharField/TextField option to allow specifying a charset that uses its default collation is not warranted.

Simon

Adam Johnson

unread,

Jul 20, 2020, 5:36:52 AM7/20/20

to django-d...@googlegroups.com

I like where it's heading but I wonder if it'd be more appropriate to start by supporting a _compilable_ object instead of a string at first.

I like this idea, but I think we'd still want to offer the option of using strings, since most projects run only against one database backend. Compilable objects would be more effort so perhaps best to leave for now, at least until the PR is much more developed.

Another question I had is whether or not the option should be named db_collation like we do for db_index, db_column, and db_tablespace since it's a database only option. It feels like using the db_ prefix would make most sense here particularly if we only allow string collations to be specified at first.

+1 to this.

The contrib.postgres migration operations to create and delete collation could be useful but it increases the scope of your proposed changed and will slow down its review and affect its mid-term mergeability.

Yes, it is probably best to extract the operations to another ticket/PR.

To view this discussion on the web visit https://groups.google.com/d/msgid/django-developers/c831ec4c-2aa8-4cbc-b41f-c5e3b654a9f3n%40googlegroups.com.

--

Adam

charettes

unread,

Jul 20, 2020, 9:35:18 AM7/20/20

to Django developers (Contributions to Django itself)

> I like this idea, but I think we'd still want to offer the option of using strings, since most projects run only against one database backend. Compilable objects would be more effort so perhaps best to leave for now, at least until the PR is much more developed.

Makes sense, lets focus on only adding support for db_collation: str for now then.

Simon

Tom Carrick

unread,

Nov 17, 2020, 10:00:53 AM11/17/20

to django-d...@googlegroups.com

I just want to follow up on this as there is one remaining collation related ticket left: https://code.djangoproject.com/ticket/5745

Reading through, it requests two things: a setting in DATABASES for a collation and a model Meta option for table collations.

As far as I'm aware, the user is responsible for creating the database so I'm not sure why a setting is needed here. Maybe someone can enlighten me.

For table collations, only MySQL supports this as far as I know. It will also only change the default collation for the table, so if you have existing columns, it won't change their collations, but will set them for new columns. This is good but could be confusing if you're only reading the code, you'd still need to look at the database to know which collation each column has -- I guess this is technically true already though. It's probably not a lot of work to add this, but I'm also not sure if it's worth doing.

Any thoughts? It'd be good to close the ticket, one way or another.

Tom

To view this discussion on the web visit https://groups.google.com/d/msgid/django-developers/755553b9-ce4c-41eb-aa0a-b58ce5df9099n%40googlegroups.com.

Adam Johnson

unread,

Nov 17, 2020, 11:16:55 AM11/17/20

to django-d...@googlegroups.com

As far as I'm aware, the user is responsible for creating the database so I'm not sure why a setting is needed here. Maybe someone can enlighten me.

This is not the case in tests, so a setting would be needed there at least.

(Also I don't like the "create your own DB" approach of Django, it would be nice if it could create it for you, or if it exists set its charset/collation to the configured values.)

For table collations, only MySQL supports this as far as I know. It will also only change the default collation for the table, so if you have existing columns, it won't change their collations, but will set them for new columns. This is good but could be confusing if you're only reading the code, you'd still need to look at the database to know which collation each column has -- I guess this is technically true already though. It's probably not a lot of work to add this, but I'm also not sure if it's worth doing.

This is true. Potentially the migration operation that sets table collation should set the per-column collations, where not individually configured, to work around this. If not we should document this limitation.

To view this discussion on the web visit https://groups.google.com/d/msgid/django-developers/CAHoz%3DMYEwyUGTFHqakNx5rG4%2BXjN-s-xRn7PJmpUJTMY8p7aXw%40mail.gmail.com.