[Django] #32803: Not obvious interaction between Postgresql trgm index and icontains lookup

6 views

Skip to first unread message

Django

unread,

Jun 1, 2021, 6:50:10 AM6/1/21

to django-...@googlegroups.com

#32803: Not obvious interaction between Postgresql trgm index and icontains lookup
-------------------------------------+-------------------------------------
Reporter: Dmitri | Owner: (none)
Emelianov |
Type: | Status: new
Uncategorized |
Component: | Version: dev
contrib.postgres | Keywords: postgresql
Severity: Normal | icontains lookup search
Triage Stage: | Has patch: 0
Unreviewed |
Needs documentation: 0 | Needs tests: 0
Patch needs improvement: 0 | Easy pickings: 0
UI/UX: 0 |
-------------------------------------+-------------------------------------
Hi! Long post ahead.

tldr; standard lookup is not working with standard index as expected

Using LIKE UPPER in icontains is not an accident, it is even stated that
it's
[https://github.com/django/django/blob/ca9872905559026af82000e46cde6f7dedc897b6/django/db/backends/postgresql/operations.py#L102
faster]

{{{
# Use UPPER(x) for case-insensitive lookups; it's faster.
if lookup_type in ('iexact', 'icontains', 'istartswith', 'iendswith'):
lookup = 'UPPER(%s)' % lookup
}}}

However, the main reason I've created the ticket is not the speed, but
usage icontains together with trgm index.

Let's create a simple test model:

{{{
class Book(models.Model):
name = models.TextField()
}}}

and prefil it with kind of random data.

Random function (idea taken from
https://stackoverflow.com/a/5478750/1185696):
{{{
CREATE OR REPLACE FUNCTION get_random_name() RETURNS text as $$
SELECT array_to_string(array
(
select
substr('abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789',
trunc(random() * 62)::integer + 1, 1)
FROM generate_series(1, (random() * 15)::integer)), '') ||
' ' || array_to_string(array
(
select
substr('abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789',
trunc(random() * 62)::integer + 1, 1)
from generate_series(1, (random() * 15)::integer)), '');
$$
VOLATILE
LANGUAGE SQL;

-- Example output: M2Fib gJr4CHLnZHcr
-- I want data to be kind of "real" if it is could be said about random
string
}}}

->
{{{
INSERT INTO book_book(name) SELECT get_random_name() FROM
generate_series(1, 10000);
}}}

Now let's do some search.
{{{
Book.objects.filter(name__icontains='orc')

# SELECT "book_book"."id", "book_book"."name" FROM "book_book" WHERE
UPPER("book_book"."name"::text) LIKE UPPER('%orc%') ORDER BY
"book_book"."id" ASC
# -> Seq Scan on book_book (cost=0.00..220.00 rows=80 width=24) (actual
time=5.755..5.755 rows=0 loops=1)
# Filter: (upper(name) ~~ '%ORC%'::text)
# Rows Removed by Filter: 9996
}}}

Seq scan, totally understandable, no complains here. Now let's do 1m
entries:
{{{
-> Parallel Seq Scan on book_book (cost=0.00..8865.51 rows=3098
width=24) (actual time=2.024..341.389 rows=143 loops=3)
Filter: (upper(name) ~~ '%ORC%'::text)
Rows Removed by Filter: 369857
Planning Time: 0.214 ms
Execution Time: 374.926 ms
}}}

Still OK, but let's do better by adding indices. Since pg_trgm is very
popular solution for text search, let's use it (I am not sure I know the
alternatives aside from tsvector for such kind of queries):
{{{
indexes = [
GinIndex(name='test_index', fields=['name'],
opclasses=['gin_trgm_ops']),
]
}}}
Repeat query:
{{{
-> Parallel Seq Scan on book_book (cost=0.00..14641.50 rows=46 width=24)
(actual time=331.265..331.266 rows=0 loops=3)
Filter: (upper(name) ~~ '%ORC%'::text)
Rows Removed by Filter: 370000
Planning Time: 1.524 ms
Execution Time: 368.272 ms
}}}
Looks like index is not working.

Change query a bit:
{{{
EXPLAIN ANALYZE SELECT "book_book"."id", "book_book"."name" FROM
"book_book" WHERE "book_book"."name"::text ILIKE '%orc%' ORDER BY
"book_book"."id" ASC;
}}}
->
{{{
-> Bitmap Heap Scan on book_book (cost=20.86..426.28 rows=111
width=24) (actual time=0.158..1.441 rows=430 loops=1)
Recheck Cond: (name ~~* '%orc%'::text)
Heap Blocks: exact=414
-> Bitmap Index Scan on test_index (cost=0.00..20.83 rows=111
width=0) (actual time=0.075..0.076 rows=430 loops=1)
Index Cond: (name ~~* '%orc%'::text)
Planning Time: 0.178 ms
Execution Time: 1.553 ms
}}}

I call "UPPPER" function usage as "implementation detail", because
postgres have standard operator. And the main issue is that implementation
detail adds additional complexity to index and you basically have to
remember to use "upper" in all trgm indices. icontains is standard lookup.
trgm index is kind of standard these days. I believe, one would expect
pretty straightforward interaction between lookup and index (lookup should
become faster). Current behavior, when default index is basically not
working for default lookup is not obvious for me. This can be easily
avoided just by using standard, out of the box operator.

Now let's do some measures to check if it is faster or not. I will use
same data generator, I will use "orc" query always, I will always pick
average of 3 measurements. No indices applied.
{{{
PostgreSQL 12.6 (Ubuntu 12.6-0ubuntu0.20.04.1) on x86_64-pc-linux-gnu,
compiled by gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0, 64-bit
+
en_US.UTF-8 collation
}}}

Results:
{{{
10k 100k
1m 10m
ILIKE 9.0 111
629 4247
UPPER 6.6 72
448 3175
LOWER 7.1 71
462 2962
}}}

https://stackoverflow.com/questions/20336665/lower-like-vs-ilike question
shows similar difference, but notes that everything could be more
complicated.

Indead, in my environment, ILIKE shows lower performance and I have no
explanation about that.

However, I would call this results implementation detail as well, because
the main question is using "standard" operators and predictable
interactions rather than performance. If one is concerned about
performance, it could use an index. I also understand, that starting from
3.2 it could be solved by custom operator in Index, but how many people
would expect that default won't work? And what about indices created
before 3.2? Especially when docs clearly tells (with shady SQL equivalent
term) that "icontains" will use ILIKE. I guess this either could be
changed to ILIKE (with performance penalty and broken existing custom
UPPER indices), either heavily documented - icontains lookup should
clearly tells which query it will use and which index should be used. Also
GinIndex/GistIndex could use a note about TEXT fields (that you probably
want to use UPPER function for text).

--
Ticket URL: <https://code.djangoproject.com/ticket/32803>
Django <https://code.djangoproject.com/>
The Web framework for perfectionists with deadlines.

Django

unread,

Jun 2, 2021, 2:03:06 AM6/2/21

to django-...@googlegroups.com

#32803: Not obvious interaction between Postgresql trgm index and icontains lookup
-------------------------------------+-------------------------------------

Needs tests: 0 | Patch needs improvement: 0
Easy pickings: 0 | UI/UX: 0
-------------------------------------+-------------------------------------

Changes (by Mariusz Felisiak):

* owner: (none) => nobody
* resolution: => duplicate
* status: new => closed
* component: contrib.postgres => Database layer (models, ORM)
* type: Uncategorized => New feature