Learning how to use stopwords in a frequency distribution

Carlos Araya

unread,

Feb 10, 2017, 1:48:35 AM2/10/17

to nltk-users

I'm learning how to work with NLTK and I'm hitting the following error with the code below

The error is

Traceback (most recent call last):

File "<stdin>", line 1, in <module>

File "<stdin>", line 1, in <listcomp>

TypeError: argument of type 'WordListCorpusReader' is not iterable

The error is in the filteredList comprehension but I can't figure out why and how to write it so it actually works.

Any help is appreciated or a pointer to where can I find docs to work through this.

Thanks

Carlos

#!/usr/bin/env python3

# Imports json module

import json

# Loads the books we downloaded

from nltk.book import *

# Import stopwords list for English

from nltk.corpus import stopwords

# Set the stopwords words to English

stop = set(stopwords.words('english'))

# Creates a frequency distribution for

fdist1 = FreqDist(text1)

# Creates a list of the 200 most common words on Moby Dick

mostCommon = fdist1.most_common(200)

# Print out most common

# print(mostCommon)

filteredList = [w for w[0] in mostCommon if w not in stopwords]

# Write o utput to file

with open('cloud.json', "w") as f:

f.write(json.dumps(filteredList, indent=2))

Constantin Orăsan

unread,

Feb 10, 2017, 4:20:58 AM2/10/17

to nltk-users

Hello,

If I understand correctly what you are trying to do, you need to change the way filteredList is initialised

filteredList = [w for w in mostCommon if w[0] not in stop]

Regards,

Constantin

--
You received this message because you are subscribed to the Google Groups "nltk-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to nltk-users+...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Mazhar Dootio

unread,

Feb 13, 2017, 2:50:49 PM2/13/17

to nltk-users

Hello every one

I am new to NLTK but working well on Python-3

At this stage I am working on Sindhi language to analyze the corpus. Sindhi language is like arabic language. May you help me in developing NLTK stop words and stemming words for Sindhi language. I need complete tutorial for developing NLTK Sindhi stop and stemming words.

Dimitriadis, A. (Alexis)

unread,

Feb 14, 2017, 4:55:43 AM2/14/17

to nltk-...@googlegroups.com

Hi Mazhar,

To make a stopword list, take a corpus of your language and rank the words by frequency. Then read down the list of most frequent words and select the ones you want to treat as stopwords.

Nobody can tell you how to write a stemmer for a new language. Look for scholarly papers describing stemming algorithms for related major languages like Gujarati, or for open-source software that you could adapt or reimplement in python. I would start with the “snowball stemmer”, since the nltk includes an implementation for it.

I recommend you start by studying the nltk book, which describes some related algorithms and tools, and improving your knowledge of Python. Come back to this list when you have specific questions.

Best,

Alexis

Reply all

Reply to author

Forward