You're right - Elasticsearch gets its default list of stopwords from Apache Lucene, which I believe are as follows for English:
a, an, and, are, as, at, be, but, by, for, if, in, into, is, it, no, not, of, on, or, such, that, the, their, then, there, these, they, this, to, was, will, with
I'm not sure about version 2.4, since this uses Elasticsearch 1.7, but in ES 5.6 (used in AtoM 2.6) and later, there is a way to create your own custom list of stopwords. I do strongly recommend that you consider upgrading if possible, since there have been a number of security patches, performance enhancements, and bug fixes in subsequent releases. As you've noted, 2.6 also has some minor improvements to how stopwords are handled (for example, #
12186), but you may still want to further customize the settings yourself.
Here's the ES 5.6 documentation on the Stop Analyzer:
Those docs note a couple configuration parameters:
- stopwords: A pre-defined stop words list like _english_ or an array containing a list of stop words. Defaults to _english_.
- stopwords_path: The path to a file containing stop words. This path is relative to the Elasticsearch config directory.
Below on the same page, it also
gives an example, where you can see what configuring your own list of stopwords in-line looks like:
"stopwords": ["the", "over"]
I believe that you could try configuring this in AtoM, or the path to a separate stopwords file with a list of terms (using stopword_path), here:
Be sure to restart Elasticsearch after making changes. On Ubuntu 16.04 or 18.04:
- sudo systemctl restart elasticsearch
Hope this helps!
Cheers,