Proposal to not implicitly create varchar/text-pattern opclass indexes on PostgreSQL

138 views
Skip to first unread message

Hannes Ljungberg

unread,
Apr 12, 2020, 10:30:50 AM4/12/20
to Django developers (Contributions to Django itself)
Hi all, 

I would like to continue the discussion started in this very old thread:

I’m sorry if I should've continued the discussion in that thread but it felt a bit wrong to bring a 5 year old thread back to life :-)

Anyway, as previously described in that thread the implicit creation of the `*_pattern_ops` index when creating indexes on `CharField` and `TextField` with `db_index=True` is not ideal. 

In my experience `LIKE` expressions is not that common that it warrants to always create an index to cover this. 

For very large tables this can become a problem where insertion/update performance is negatively affected and disc space usage is twice what really is needed if no `LIKE` queries are used.

And even if one would like to use `LIKE` expressions it’s not obvious that the `*_pattern_ops` is the correct index. For leading wildcard patters as `LIKE %foo` one has to use a GIN/GiST index with the `*_trgm_ops` opclass. With the current implementation we would end up with 3 indexes when 2 would be sufficient for this use case (the regular b-tree and the trigram).

One could also argue that we’re not consistent with these implicit indexes. The `iexact`/`icontains` lookups require an expression index on `(UPPER(col))`  but that’s not created.

One important detail is that this implicit index is _not_ created when using the class based `Index` . In my opinion it’s not very clear that one needs to handle the creation of a `*_pattern_ops` index manually when using it.  

This is the only documentation that I’ve been able to find about the creation of these implicit indexes: https://docs.djangoproject.com/en/dev/ref/databases/#indexes-for-varchar-and-text-columns

My proposal is to remove the implicit index created by `db_index=True` and add documentation that one should use `Index.opclasses` to utilise indexes for `LIKE` queries on PostgresSQL. This would give a more consistent behaviour and being more explicit about this would help people tune their indexes.

If there’s a consensus on this I would like to give this a shot. If there’s not an agreement on removing the implicit index I think we should at least make the documentation around this a bit more clear about the gotchas of `Index` vs `db_index=True`.

Here’s two tickets which have previously discussed this:

Stay safe!

Tim Graham

unread,
Apr 13, 2020, 9:01:03 PM4/13/20
to Django developers (Contributions to Django itself)
I have some sympathy for this issue as I'm trying to make the createcachetable management command use SchemaEditor rather than some custom SQL construction logic*. The related problem I ran into is that the primary key column (a CharField) uses unique=True which means those undesired opclasses indexes are created. I couldn't find a way to prevent that index besides filtering it out of the list of SQL statements.

As for your proposal, how would you handle the upgrade path for existing projects? I imagine we could provide a script to run upon upgrade to remove all such existing indexes. No doubt some users won't run it. Would you keep around the code in Django's SchemaEditor like https://github.com/django/django/blob/53d229ff632c2a3e547f2820a94239f38ba4d4ac/django/db/backends/postgresql/schema.py#L178-L180 that assumes those indexes exist?

If Django stops creating those indexes, it could create a somewhat murky situation for some developers as they try to figure out the state of indexes in their database. It might depend on which version of Django was in use when certain migrations ran.

Some developers might be left debugging performance issues if those indexes are removed. It could be helpful to gives some numbers as to the possible performance impact on applications if these indexes are removed without a developer realizing it.

Third-party apps that need it could add an Index with opclasses but then they'd face the issue of duplicate opclasses indexes if their app is used on an older version of Django.

Hannes Ljungberg

unread,
Apr 14, 2020, 8:18:23 AM4/14/20
to Django developers (Contributions to Django itself)
Thanks for your reply Tim,

I also fail to see a "clean" upgrade path. The way I’ve been thinking of, is to just as you wrote, keep the code for deletion of the index so people who migrate to the Index-class will get it removed properly. Since it uses `IF EXISTS` it should be future-proof. We could add a system check that uses introspection to check if the `*_like` index is present on affected fields with `db_index=True` and warn about its deprecation. Then after a couple (?) of versions we could remove all traces of this index. 

The negative performance impact of removing this index for users of `LIKE` queries is very large but it’s hard to give numbers since it depends on the amount and nature of the data in the affected rows. Since we wouldn’t do any automatic removal of the index and add documentation about the importance of the opclass-indexes for `LIKE` queries I’m not that worried.

I realise that removing the creation of these indexes result in a not so pretty deprecation path but I think users of Django would benefit from having to be more explicit about their indexes and learn about the quirks of their database engines.

Adam Johnson

unread,
Apr 14, 2020, 9:01:48 AM4/14/20
to django-d...@googlegroups.com
For some numbers: I just checked on a long-running client project using PostgreSQL, and it looks like several of the *_like  indexes are in use:

core=> select right(indexrelname, 10), idx_tup_read, idx_tup_fetch from pg_stat_all_indexes where indexrelname like '%_like' order by 1;
   right    | idx_tup_read | idx_tup_fetch
------------+--------------+---------------
 090f8_like |       168181 |        168143
 0b77e_like |            0 |             0
 0cc84_like |            0 |             0
 1ab7c_like |         6260 |          6137
 32a5e_like |      2443202 |       2412585
 33678_like |            0 |             0
 37b91_like |            0 |             0
 6c02a_like |           15 |            15
 75088_like |        56746 |          9640
 90e0f_like |         8371 |          8089
 a08ec_like |         2815 |          2815
 b4ddf_like |            0 |             0
(12 rows)

The first highest read index is accidental and the query will be rewritten.
The second-highest read index here though is from the third party app django-oauth-toolkit: https://github.com/jazzband/django-oauth-toolkit/blob/master/oauth2_provider/models.py#L284  .
The other indexes are on small tables and I'm not concerned about them from a performance perspective.


Perhaps it's possible to use these index statistics in the system check to prioritize which indexes are warned about.

--
You received this message because you are subscribed to the Google Groups "Django developers (Contributions to Django itself)" group.
To unsubscribe from this group and stop receiving emails from it, send an email to django-develop...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/django-developers/6b539bf5-4de2-434f-b022-8729bf49a593%40googlegroups.com.


--
Adam

charettes

unread,
Apr 14, 2020, 9:17:14 AM4/14/20
to Django developers (Contributions to Django itself)
Makes we wonder under which circumstances an OAuth token would need to be looked up using LIKE. I would expect the unique b-tree index which covers exact lookups to be sufficient.

Simon

Le mardi 14 avril 2020 09:01:48 UTC-4, Adam Johnson a écrit :
For some numbers: I just checked on a long-running client project using PostgreSQL, and it looks like several of the *_like  indexes are in use:

core=> select right(indexrelname, 10), idx_tup_read, idx_tup_fetch from pg_stat_all_indexes where indexrelname like '%_like' order by 1;
   right    | idx_tup_read | idx_tup_fetch
------------+--------------+---------------
 090f8_like |       168181 |        168143
 0b77e_like |            0 |             0
 0cc84_like |            0 |             0
 1ab7c_like |         6260 |          6137
 32a5e_like |      2443202 |       2412585
 33678_like |            0 |             0
 37b91_like |            0 |             0
 6c02a_like |           15 |            15
 75088_like |        56746 |          9640
 90e0f_like |         8371 |          8089
 a08ec_like |         2815 |          2815
 b4ddf_like |            0 |             0
(12 rows)

The first highest read index is accidental and the query will be rewritten.
The second-highest read index here though is from the third party app django-oauth-toolkit: https://github.com/jazzband/django-oauth-toolkit/blob/master/oauth2_provider/models.py#L284  .
The other indexes are on small tables and I'm not concerned about them from a performance perspective.


Perhaps it's possible to use these index statistics in the system check to prioritize which indexes are warned about.

To unsubscribe from this group and stop receiving emails from it, send an email to django-d...@googlegroups.com.


--
Adam

Adam Johnson

unread,
Apr 14, 2020, 12:07:30 PM4/14/20
to django-d...@googlegroups.com
Yes that one is odd. Looking more closely, django-oauth-toolkit doesn't make any LIKE queries, and doesn't seem to ever have done so. It seems rather that PostgreSQL is using the index to fulfill the unique constraint, as the corresponding '_uniq' index is not being touched:

core=> select indexrelname, idx_tup_read, idx_tup_fetch from pg_stat_all_indexes where indexrelname like 'oauth2_provider_accesstoken_token_%' order by 1;
                  indexrelname                   | idx_tup_read | idx_tup_fetch
-------------------------------------------------+--------------+---------------
 oauth2_provider_accesstoken_token_8af090f8_like |       169875 |        169835
 oauth2_provider_accesstoken_token_8af090f8_uniq |            0 |             0
(2 rows)

I haven't verified all the queries being made though.

May still be useful for any check added to Django.

To unsubscribe from this group and stop receiving emails from it, send an email to django-develop...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/django-developers/72881ebc-73d7-4cf2-ad1e-8c9e1274e2f8%40googlegroups.com.


--
Adam

Hannes Ljungberg

unread,
Apr 14, 2020, 1:59:30 PM4/14/20
to Django developers (Contributions to Django itself)
The `*_pattern_ops` operators all support ordinary equality comparisons but not  <, <=, >, or >= comparisons. So a regular `WHERE col = 1` would be able to use a `_like` index. See: https://www.postgresql.org/docs/current/indexes-opclass.html

But it's still interesting that the query planner always choose the non-unique index, one would think that the UNIQUE-index would weigh higher even if the cardinality is the same.

Den tisdag 14 april 2020 kl. 18:07:30 UTC+2 skrev Adam Johnson:
Yes that one is odd. Looking more closely, django-oauth-toolkit doesn't make any LIKE queries, and doesn't seem to ever have done so. It seems rather that PostgreSQL is using the index to fulfill the unique constraint, as the corresponding '_uniq' index is not being touched:

core=> select indexrelname, idx_tup_read, idx_tup_fetch from pg_stat_all_indexes where indexrelname like 'oauth2_provider_accesstoken_token_%' order by 1;
                  indexrelname                   | idx_tup_read | idx_tup_fetch
-------------------------------------------------+--------------+---------------
 oauth2_provider_accesstoken_token_8af090f8_like |       169875 |        169835
 oauth2_provider_accesstoken_token_8af090f8_uniq |            0 |             0
(2 rows)

I haven't verified all the queries being made though.

May still be useful for any check added to Django.

Reply all
Reply to author
Forward
0 new messages