Currently it is possible to filter a queryset on the basis of trigram
similarity between the search string and full text stored in a column.
{{{#!python
Author.objects.annotate(similarity=TrigramSimilarity('name',
test),).filter(similarity__gt=0.3)
}}}
This is a wrapper around the '''similarity''' function of the
'''pg_trgm''' extension. While it allows comparing full strings, i.e.
searching for 'doge' would find 'dogs' or 'dogge' it is useless for fuzzy
searching of substrings.
{{{
SELECT similarity('dogge', 'doge');
---------
0.57
SELECT similarity('dogge', 'dogecoin is following bitcoin');
------------
0.1
}}}
'''word_similarity''' does take into account the word boundaries
{{{
SELECT word_similarity('doge', 'dogecoin is following bitcoin');
--------------
0.5
}}}
Adding a django API to '''word_similarity''' would allow for better fuzzy
fulltext search without a need to use either raw SQL or external tools
like elasticsearch.
--
Ticket URL: <https://code.djangoproject.com/ticket/32492>
Django <https://code.djangoproject.com/>
The Web framework for perfectionists with deadlines.
Old description:
New description:
--
--
Ticket URL: <https://code.djangoproject.com/ticket/32492#comment:1>
* owner: nobody => Taneli
* status: new => assigned
--
Ticket URL: <https://code.djangoproject.com/ticket/32492#comment:2>
* type: Uncategorized => New feature
* version: 3.1 => master
* component: Uncategorized => contrib.postgres
* stage: Unreviewed => Accepted
Comment:
Thanks. We should add also `TrigramWordDistance()` with the `<<->`
operator.
--
Ticket URL: <https://code.djangoproject.com/ticket/32492#comment:3>
Comment (by Matthew Schinckel):
FWIW, I use pg_trgm, and I also have a `connection_created` receiver that
sets `pg_trgm.similarity_threshold`. It might be nice to include the
ability to set this in some way, so that other operations that use this
can be included.
--
Ticket URL: <https://code.djangoproject.com/ticket/32492#comment:4>
* owner: Taneli => (none)
--
Ticket URL: <https://code.djangoproject.com/ticket/32492#comment:5>
* owner: (none) => Nikita Marchant
--
Ticket URL: <https://code.djangoproject.com/ticket/32492#comment:6>
Comment (by Nikita Marchant):
Hi 👋
I've open a pull request (https://github.com/django/django/pull/14833)
implementing the `trigram_word_similar` lookup and updating the
documentation.
I also wrote a test, but for it to be meaningful, i needed a longer string
that the 16 chars available on `CharFieldModel.field` in the tests. I did
increase the limit to 64 and updated the migration at the same time. Is
this the right approach or should it create a new model or a new migration
? (As the test database is recreated almost each time, i guessed that
modifying a migration is not a big deal but i could be wrong).
I ran the tests on PostgreSQL 13.4 on Mac with Python 3.9.1
I also have some code (adding `TrigramWordSimilarity` and
`TrigramWordDistance`) almost ready for the rest of the ticket but i still
have some troubles/questions because the order of the arguments of
`WORD_SIMILARITY()` is meaningful, unlike for `SIMILARITY()`. Should we
discuss it here or on GitHub ?
--
Ticket URL: <https://code.djangoproject.com/ticket/32492#comment:7>
* has_patch: 0 => 1
--
Ticket URL: <https://code.djangoproject.com/ticket/32492#comment:8>
* needs_better_patch: 0 => 1
* needs_docs: 0 => 1
Comment:
[https://github.com/django/django/pull/14833 PR]
--
Ticket URL: <https://code.djangoproject.com/ticket/32492#comment:9>
* needs_better_patch: 1 => 0
* needs_docs: 1 => 0
--
Ticket URL: <https://code.djangoproject.com/ticket/32492#comment:10>
* stage: Accepted => Ready for checkin
--
Ticket URL: <https://code.djangoproject.com/ticket/32492#comment:11>
* cc: Paolo Melchiorre (added)
--
Ticket URL: <https://code.djangoproject.com/ticket/32492#comment:12>
* status: assigned => closed
* resolution: => fixed
Comment:
In [changeset:"4e4082f9396e21de0bd88dbfc651da9ad01c7c0c" 4e4082f9]:
{{{
#!CommitTicketReference repository=""
revision="4e4082f9396e21de0bd88dbfc651da9ad01c7c0c"
Fixed #32492 -- Added TrigramWordSimilarity() and TrigramWordDistance() on
PostgreSQL.
}}}
--
Ticket URL: <https://code.djangoproject.com/ticket/32492#comment:13>