Learning how to use stopwords in a frequency distribution

1,336 views
Skip to first unread message

Carlos Araya

unread,
Feb 10, 2017, 1:48:35 AM2/10/17
to nltk-users
I'm learning how to work with NLTK and I'm hitting the following error with the code below

The error is 

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<stdin>", line 1, in <listcomp>
TypeError: argument of type 'WordListCorpusReader' is not iterable


The error is in the filteredList comprehension but I can't figure out why and how to write it so it actually works.

Any help is appreciated or a pointer to where can I find docs to work through this. 

Thanks

Carlos

#!/usr/bin/env python3

# Imports json module
import json
# Loads the books we downloaded
from nltk.book import *
# Import stopwords list for English
from nltk.corpus import stopwords

# Set the stopwords words to English
stop = set(stopwords.words('english'))
# Creates a frequency distribution for
fdist1 = FreqDist(text1)

# Creates a list of the 200 most common words on Moby Dick
mostCommon = fdist1.most_common(200)

# Print out most common
# print(mostCommon)


filteredList = [w for w[0] in mostCommon if w not in stopwords]

# Write o utput to file
with open('cloud.json', "w") as f:
    f.write(json.dumps(filteredList, indent=2))


Constantin Orăsan

unread,
Feb 10, 2017, 4:20:58 AM2/10/17
to nltk-users
Hello,

If I understand correctly what you are trying to do, you need to change the way filteredList is initialised

filteredList = [w for w in mostCommon if w[0] not in stop]

Regards,

Constantin


--
You received this message because you are subscribed to the Google Groups "nltk-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to nltk-users+...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Mazhar Dootio

unread,
Feb 13, 2017, 2:50:49 PM2/13/17
to nltk-users
Hello every one
I am new to NLTK but working well on Python-3
At this stage I am working on Sindhi language  to analyze the corpus. Sindhi language is like arabic language. May you help me in developing NLTK stop words and stemming words for Sindhi language. I need complete tutorial for developing NLTK Sindhi stop and stemming words.

Dimitriadis, A. (Alexis)

unread,
Feb 14, 2017, 4:55:43 AM2/14/17
to nltk-...@googlegroups.com
Hi Mazhar,

To make a stopword list, take a corpus of your language and rank the words by frequency. Then read down the list of most frequent words and select the ones you want to treat as stopwords.

Nobody can tell you how to write a stemmer for a new language. Look for scholarly papers describing stemming algorithms for related major languages like Gujarati, or for open-source software that you could adapt or reimplement in python. I would start with the “snowball stemmer”, since the nltk includes an implementation for it.  

I recommend you start by studying the nltk book, which describes some related algorithms and tools, and improving your knowledge of Python. Come back to this list when you have specific questions.

Best,

Alexis

Dr. Alexis Dimitriadis | Assistant Professor and Senior Research Fellow | Utrecht Institute of Linguistics OTS | Utrecht University | Trans 10, 3512 JK Utrecht, room 2.33 | +31 30 253 65 68 | a.dimi...@uu.nl | www.hum.uu.nl/medewerkers/a.dimitriadis
Reply all
Reply to author
Forward
0 new messages