Missing English stopwords

151 views
Skip to first unread message

Craig Hagerman

unread,
Jan 6, 2011, 6:37:21 AM1/6/11
to nltk-...@googlegroups.com

Hi,

In 'Natural Language Processing with Python' there is an example of using nltk.corpus.stopwords on page 60-6. When I do the same I got a slightly different result for the stopwords. It seems the English stopwords list only contains 127 words. This seems rather small to me, and I found indications on the web that that list used to be over 500 words. What has happened? Why were so many words cut. How can I add them back in?

Craig

Richard Careaga

unread,
Jan 6, 2011, 10:36:48 AM1/6/11
to nltk-...@googlegroups.com
stops = stopwords.words('english')
mystops = ['rather', 'word']
mystops.extend(bar)
nonstopwords = [w.lower() for w in X if w.isalpha() and w.lower() not in mystops]


--
You received this message because you are subscribed to the Google Groups "nltk-users" group.
To post to this group, send email to nltk-...@googlegroups.com.
To unsubscribe from this group, send email to nltk-users+...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/nltk-users?hl=en.

Craig Hagerman

unread,
Jan 6, 2011, 10:44:42 AM1/6/11
to nltk-...@googlegroups.com
Thanks Richard,

But that doesn't really solve the question of where all the missing
stop words have gone. If I follow your procedure I have to type in a
few hundred by hand which, presumably, are still in a list somewhere.
If this is the way I have to do things, does anyone have a list of
stopwords I can copy?

Craig

Richard Careaga

unread,
Jan 6, 2011, 11:28:22 AM1/6/11
to nltk-...@googlegroups.com
One man's stopword is another man's signifier, I guess. You want to filter more, others may want not to miss as much.  As for typing by hand, if you want to construct your own list, tokenize some likely text, sort by frequency, take the top several hundred or so, edit out the words you want to keep as signifiers, and there's your mystop, which you can put on your PYTHONPATH to be available to supplement the nltk package.



Craig Hagerman
January 6, 2011 10:44 AM
Reply all
Reply to author
Forward
0 new messages