You're right - Elasticsearch gets its default list of stopwords from Apache Lucene, which I believe are as follows for English:
a, an, and, are, as, at, be, but, by, for, if, in, into, is, it, no, not, of, on, or, such, that, the, their, then, there, these, they, this, to, was, will, with
I'm not sure about version 2.4, since this uses Elasticsearch 1.7, but in ES 5.6 (used in AtoM 2.6) and later, there is a way to create your own custom list of stopwords. I do strongly recommend that you consider upgrading if possible, since there have been a number of security patches, performance enhancements, and bug fixes in subsequent releases. As you've noted, 2.6 also has some minor improvements to how stopwords are handled (for example, #12186
), but you may still want to further customize the settings yourself.
Here's the ES 5.6 documentation on the Stop Analyzer:
Those docs note a couple configuration parameters:
- stopwords: A pre-defined stop words list like _english_ or an array containing a list of stop words. Defaults to _english_.
- stopwords_path: The path to a file containing stop words. This path is relative to the Elasticsearch config directory.
Below on the same page, it also gives an example
, where you can see what configuring your own list of stopwords in-line looks like:
"stopwords": ["the", "over"]
I believe that you could try configuring this in AtoM, or the path to a separate stopwords file with a list of terms (using stopword_path), here:
Be sure to restart Elasticsearch after making changes. On Ubuntu 16.04 or 18.04:
- sudo systemctl restart elasticsearch
Hope this helps!