You do not have permission to delete messages in this group
Copy link
Report message
Sign in to report message
Show original message
Either email addresses are anonymous for this group or you need the view member email addresses permission to view the original message
to nltk-users
I have made use of the NLTK stopword list in my dissertation but it is rather important that I know how it was derived and I cannot find this information anywhere. I would be VERY appreciative if someone could share this information or point me in the right direction. Thanks
Francis Bond
unread,
Mar 16, 2016, 9:23:45 AM3/16/16
Reply to author
Sign in to reply to author
Forward
Sign in to forward
Delete
You do not have permission to delete messages in this group
Copy link
Report message
Sign in to report message
Show original message
Either email addresses are anonymous for this group or you need the view member email addresses permission to view the original message
to nltk-users
>>> print(nltk.corpus.stopwords.readme())
Stopwords Corpus
This corpus contains lists of stop words for several languages. These
are high-frequency grammatical words which are usually ignored in text
retrieval applications.
> --
> You received this message because you are subscribed to the Google Groups
> "nltk-users" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to nltk-users+...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.
You do not have permission to delete messages in this group
Copy link
Report message
Sign in to report message
Show original message
Either email addresses are anonymous for this group or you need the view member email addresses permission to view the original message
to nltk-...@googlegroups.com
The NLTK book credits the stopword list to "Porter et al.", with no further elaboration (there's no corresponding reference.) Do you just need something you can cite, or were you after information on the criteria for including words to the stopword list? A quick google search brought me to the Snowball website, which will help you a bit with both. I believe the list was basically hand-compiled, with exclusion of irregular word forms as the primary goal (in order to provide better input to the stemmer). But if you're after the exact process you may have to contact Martin Porter directly. Good luck.
Dr. Alexis Dimitriadis| Assistant Professor and Senior Research Fellow | Utrecht Institute of Linguistics OTS | Utrecht University | Trans 10, 3512 JK Utrecht, room 2.33 | +31 30 253 65 68 |a.dimi...@uu.nl|www.hum.uu.nl/medewerkers/a.dimitriadis
You do not have permission to delete messages in this group
Copy link
Report message
Sign in to report message
Show original message
Either email addresses are anonymous for this group or you need the view member email addresses permission to view the original message
to nltk-...@googlegroups.com
The Porter et al refers to the original Porter stemmer paper I believe - Porter,M.F. (1980): An algorithm for suffix stripping. Program 14 (3): 130—37. - although the et al is confusing to me. I remember being told the stopwords for English that the stemmer used came from a different source, likely this one - "Information retrieval" by C. J. Van Rijsbergen (Butterworths, London, 1979).