Stopword list

evez...@gmail.com

unread,

Mar 16, 2016, 9:14:24 AM3/16/16

to nltk-users

I have made use of the NLTK stopword list in my dissertation but it is rather important that I know how it was derived and I cannot find this information anywhere. I would be VERY appreciative if someone could share this information or point me in the right direction. Thanks

Francis Bond

unread,

Mar 16, 2016, 9:23:45 AM3/16/16

to nltk-users

>>> print(nltk.corpus.stopwords.readme())
Stopwords Corpus

This corpus contains lists of stop words for several languages. These
are high-frequency grammatical words which are usually ignored in text
retrieval applications.

They were obtained from:
http://anoncvs.postgresql.org/cvsweb.cgi/pgsql/src/backend/snowball/stopwords/

>>>

I hope this helps.

> --
> You received this message because you are subscribed to the Google Groups
> "nltk-users" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to nltk-users+...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

--
Francis Bond <http://www3.ntu.edu.sg/home/fcbond/>
Division of Linguistics and Multilingual Studies
Nanyang Technological University

Alexis

unread,

Mar 16, 2016, 10:49:49 AM3/16/16

to nltk-...@googlegroups.com

The NLTK book credits the stopword list to "Porter et al.", with no further elaboration (there's no corresponding reference.) Do you just need something you can cite, or were you after information on the criteria for including words to the stopword list? A quick google search brought me to the Snowball website, which will help you a bit with both. I believe the list was basically hand-compiled, with exclusion of irregular word forms as the primary goal (in order to provide better input to the stemmer). But if you're after the exact process you may have to contact Martin Porter directly. Good luck.

Alexis

PS. Here's the stopword list for English, with comments: http://snowball.tartarus.org/algorithms/english/stop.txt.

On 16 Mar 2016, at 14:23, Francis Bond <bo...@ieee.org>

wrote:

Sneha Jha

unread,

Mar 16, 2016, 11:15:08 AM3/16/16

to nltk-...@googlegroups.com

The Porter et al refers to the original Porter stemmer paper I believe - Porter,M.F. (1980): An algorithm for suffix stripping. Program 14 (3): 130—37. - although the et al is confusing to me. I remember being told the stopwords for English that the stemmer used came from a different source, likely this one - "Information retrieval" by C. J. Van Rijsbergen (Butterworths, London, 1979).

Reply all

Reply to author

Forward