Stopwords

151 views

Skip to first unread message

Mark Smith

unread,

Apr 15, 2022, 6:28:47 AM4/15/22

to AtoM Users

I have noticed today that the word "will" is one of the stopwords Elasticsearch doesn't index. This isn't terribly handy when looking for a will (as in Last Will and Testament) belonging to a particular person.

I realise that turning off all stopwords might not be the best idea, as the size of the index will increase massively. But does anyone have any experience of altering the list of stopwords, or of maintaining a custom list of their own? I'm not sure how straightforward this is but would be good to hear how other archives get round this problem.

By the way, we're running 2.4 at the moment. I see that 2.6.3 has some modifications to how stopwords are handled. Perhaps this might help with this problem?

Dan Gillean

unread,

Apr 18, 2022, 8:37:38 AM4/18/22

to ICA-AtoM Users

Hi Mark,

You're right - Elasticsearch gets its default list of stopwords from Apache Lucene, which I believe are as follows for English:

a, an, and, are, as, at, be, but, by, for, if, in, into, is, it, no, not, of, on, or, such, that, the, their, then, there, these, they, this, to, was, will, with

I'm not sure about version 2.4, since this uses Elasticsearch 1.7, but in ES 5.6 (used in AtoM 2.6) and later, there is a way to create your own custom list of stopwords. I do strongly recommend that you consider upgrading if possible, since there have been a number of security patches, performance enhancements, and bug fixes in subsequent releases. As you've noted, 2.6 also has some minor improvements to how stopwords are handled (for example, #12186), but you may still want to further customize the settings yourself.

Here's the ES 5.6 documentation on the Stop Analyzer:

Those docs note a couple configuration parameters:

stopwords: A pre-defined stop words list like _english_ or an array containing a list of stop words. Defaults to _english_.
stopwords_path: The path to a file containing stop words. This path is relative to the Elasticsearch config directory.

Below on the same page, it also gives an example, where you can see what configuring your own list of stopwords in-line looks like:

"stopwords": ["the", "over"]

I also found a third-party article here that may help: https://javadeveloperzone.com/elastic-search/configure-stopwords-in-elastic-search/

I believe that you could try configuring this in AtoM, or the path to a separate stopwords file with a list of terms (using stopword_path), here:

https://github.com/artefactual/atom/blob/HEAD/plugins/arElasticSearchPlugin/config/search.yml#L181

Be sure to restart Elasticsearch after making changes. On Ubuntu 16.04 or 18.04:

sudo systemctl restart elasticsearch

For good measure, I'd also suggest clearing the application cache and restarting PHP-FPM after making changes.

Hope this helps!

Cheers,

Dan Gillean, MAS, MLIS
AtoM Program Manager
Artefactual Systems, Inc.
604-527-2056

@accesstomemory

he / him

--
You received this message because you are subscribed to the Google Groups "AtoM Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ica-atom-user...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/ica-atom-users/ced6cb9b-05e1-4aef-91bb-6ce80df50c82n%40googlegroups.com.

Reply all

Reply to author

Forward

0 new messages