Most efficient and scalable search strategy

27 views
Skip to first unread message

Mike Dewhirst

unread,
May 23, 2017, 5:15:35 AM5/23/17
to Django users
Django 1.10, Postgres 9.1 (for the moment), Python 2.7 (for the moment)
on Ubuntu 16.04

The search I need is an intersection hit.

We have a substance model with fields for ...

- name
- molecular formula with subscripted unicode numbers
- CAS number

plus a Synonym model in 1:n with substance and there might typically be
half a dozen synonyms per substance.

We also have a separateReference_Ozone table containing all the ozone
depleting substances listed in the Montreal Protocol imported from a csv
file. Fields include ...

- abbreviation,
- molecular formula with subscripted unicode numbers,
- chemical isomer name
- common chemical name

If any of those Substance fields matches any of those Reference_Ozone
field we have a hit and that substance is subject to the Montreal Protocol.

In addition, we have to incude ingredient substances and their synonyms
if our substance is a mixture.

I have looked at SearchVector and thought about SearchVectorField. I
have also looked at the Postgres docs regarding triggers and decided
against that path. If we do it it will have to be done on save() in the
app.

Since we import the Montreal Protocol substances and there are fewer
than 400 of them it occurs to me that during import I could make a
single (large) field which contains the contents of those four other
fields. The data is almost static for my purposes. There might be new
substances added but only rarely. It is a read-only table.

I could do the same for substances but the idea is that they will scale
beyond millions. I'd prefer not to do a similar dedicated search field
for them because it would need updating on every save.

We might also have to deal with English/US spelling issues such as 'ph'
versus 'f'. I haven't examined the content in detail yet.

So the question is, how to quickly/efficiently detect that a substance
is subject to the Montreal Protocol?

Thanks for any pointers

Mike

Avraham Serour

unread,
May 24, 2017, 10:54:48 AM5/24/17
to django-users
take a look at the postgres fulltext search capabilities, I can't make any promises for your old postgres version

another common approach is to index the search into elasticsearch and querying elasticsearch for the searches, it should handles misspellings and return the search results fast, but you may need to have a cluster of them with at least a couple of machines. I worked on a system once with single digits millions of entries with 6 machines (VM) running elasticsearch in a cluster, search results weren't fast but was reasonable. Once we had to re-index everything and it took a couple of days. We used django haystack, you should take a look at that. The problem with this approach is that you basically have a hook on the django model which copies some data over elasticsearch everytime you save()




Mike

--
You received this message because you are subscribed to the Google Groups "Django users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to django-users+unsubscribe@googlegroups.com.
To post to this group, send email to django...@googlegroups.com.
Visit this group at https://groups.google.com/group/django-users.
To view this discussion on the web visit https://groups.google.com/d/msgid/django-users/065647f9-7347-7d75-d817-83ea795006ed%40dewhirst.com.au.
For more options, visit https://groups.google.com/d/optout.

Reply all
Reply to author
Forward
0 new messages