how to filter out bigrams pairs which contains stopwords

3,472 views
Skip to first unread message

typetoken

unread,
Aug 9, 2012, 1:24:20 AM8/9/12
to nltk-...@googlegroups.com
Dear All, 

I created the following codes to list 50 most frequent bigrams which is supposed to contain no stopwords(exercise 18, p76 in nltk book). However, it still yields pairs containing stopwords:

>>> import nltk
>>> def FreqBigram5(text):
bigrams = nltk.bigrams(text)
stopwords = nltk.corpus.stopwords.words('english')
pairs = [p for p in bigrams if [w for w in p] not in stopwords]
fdist = nltk.FreqDist(pairs)
print fdist.keys()[:50]
>>> text = nltk.corpus.brown.words(categories = 'news')
>>> FreqBigram5(text)
[('of', 'the'), ('.', 'The'), ('in', 'the'), (',', 'and'), (',', 'the'), ("''", '.'), ('to', 'the'), ('on', 'the'), ('.', '``'), ('for', 'the'), ("''", ','), ('at', 'the'), ('.', 'He'), (';', ';'), ('will', 'be'), ('that', 'the'), ('and', 'the'), ('with', 'the'), (',', 'who'), (',', 'a'), (',', 'he'), ('in', 'a'), ('of', 'a'), ('by', 'the'), ('to', 'be'), ('.', 'In'), ('.', 'A'), ('.', 'But'), ('from', 'the'), ('.', 'It'), (',', '``'), ('for', 'a'), (',', 'but'), ('has', 'been'), ('as', 'a'), (',', 'in'), ('is', 'a'), ('said', '.'), (',', 'Mrs.'), ('and', 'Mrs.'), ('to', 'a'), ('with', 'a'), (',', 'which'), ('said', ','), ('the', 'first'), ('would', 'be'), (',', 'of'), (',', 'was'), ('is', 'the'), ('.', 'Mr.')]


Then I modified the above codes into the following version. Still not work. any tips? Thanks indeed.
>>> def FreqBigram5(text):
bigrams = nltk.bigrams(text)
stopwords = nltk.corpus.stopwords.words('english')
pairs = [p for p in bigrams for w in p if w.lower() not in stopwords]
fdist = nltk.FreqDist(pairs)
print fdist.keys()[:50]

>>> FreqBigram5(text)
[('.', 'The'), ("''", '.'), ('.', '``'), ("''", ','), (',', 'and'), (',', 'the'), (';', ';'), (',', '``'), ('.', 'He'), ('said', '.'), (',', 'Mrs.'), ('said', ','), ('.', 'Mr.'), (',', 'who'), (',', 'a'), (',', 'he'), ('.', 'In'), (',', 'said'), ('.', 'A'), ('New', 'York'), ('.', 'But'), ('per', 'cent'), ('.', 'It'), (',', 'but'), ('.', 'Mrs.'), (':', '``'), ('?', '?'), (',', 'in'), ('United', 'States'), ('and', 'Mrs.'), (',', 'which'), ('the', 'first'), ('year', '.'), ('would', 'be'), (',', 'of'), (',', 'however'), (',', 'was'), ('last', 'year'), (',', 'as'), ('however', ','), (',', 'Mr.'), ('.', 'They'), ('White', 'House'), ('last', 'week'), ('he', 'said'), ('Jr.', ','), (',', 'it'), ('.', 'This'), (',', 'with'), ('one', 'of')]

Many thanks.

Sincerely
T.T.

Kevin

unread,
Aug 9, 2012, 2:02:14 AM8/9/12
to nltk-...@googlegroups.com
pairs = [p for p in bigrams if [w for w in p] not in stopwords] 

to 

pairs = [tup for tup in bigrams if not False in [False for wrd in tup if wrd in stopwords] ]

typetoken

unread,
Aug 9, 2012, 3:13:03 AM8/9/12
to nltk-...@googlegroups.com
Many thanks indeed. I modified it to pairs = [tup for tup in bigrams if not False in [False for wrd in tup if wrd.lower() in stopwords]] and it achieves better result without including the following pairs eg. ('.', 'In'), ('.', 'A'), ('.', 'But'), ('.', 'It').

However, I still want to figure out why the following two approaches failed to achieve the desired results:
1) pairs = [p for p in bigrams if [w for w in p] not in stopwords]  
2) pairs = [p for p in bigrams for w in p if w.lower() not in stopwords] 


Kevin

unread,
Aug 9, 2012, 8:20:53 AM8/9/12
to nltk-...@googlegroups.com
1) pairs = [p for p in bigrams if [w for w in p] not in stopwords]  

xs = [ [w for w in p]  for p in bigrams] 
for i in xs:print i
... 
['Some', 'text']
['text', 'with']
['with', 'in']
['in', 'a']
['a', 'stopwords']

this way you are check a list in a string  if [w for w in p] not in stopwords]  

John H. Li

unread,
Aug 10, 2012, 12:46:50 AM8/10/12
to nltk-...@googlegroups.com
Many thanks. I see now. 

By the way, is it possible to select each word from the bigram list [('some', 'text'),('text', 'with'),.... ] so that we can quote each word in this list to check if these words are in stopwords  or not? Do we have such a function to slice each word in a pair in a bigram list?


Best
T.T.

John H. Li

unread,
Aug 13, 2012, 10:28:06 PM8/13/12
to nltk-...@googlegroups.com
Thanks. Is it possible to select each word from the bigram list [('some', 'text'),('text', 'with'),.... ] so that we can quote each word in this list to check if these words are in stopwords  or not? Do we have such a function to slice each word in a pair in a bigram list?
I tried the following code, yet still fail to filter out the stopwords from pairs.

>>> def FreqBigram5(text):
bigrams = nltk.bigrams(text)
stopwords = nltk.corpus.stopwords.words('english')
pairs = [p for p in bigrams for m in p if m.lower() not in stopwords]

Kevin

unread,
Aug 16, 2012, 10:02:33 PM8/16/12
to nltk-...@googlegroups.com
>>> xs
[('some', 'text'), ('text', 'with')]
>>> [w for ws in xs for w in ws if w not in ['a','the','with']]
['some', 'text', 'text']

-------------------------------------------------
or if you just want to filter


stopwords = ['a','the','with']

def filter_stopwords_bigrams(lst_of_bigrams):
    filtered = []
    for tup in lst_of_bigrams:
        if tup[0] in stopwords or tup[1] in stopwords:
            continue
        filtered.append(tup)
    return filtered

bigram_list = [('some', 'text'),('text', 'with')]
print filter_stopwords_bigrams(bigram_list)

John H. Li

unread,
Aug 17, 2012, 1:42:04 AM8/17/12
to nltk-...@googlegroups.com
Many thanks. I've learned much from this topic. With your hints, I've now come up with three modified approaches to solve the problem successfully, which are summed up as follows:

Method 1) 
>>> import nltk
>>> def FreqBigram5(text):
bigrams = nltk.bigrams(text)
stopwords = nltk.corpus.stopwords.words('english')
filtered=[]
for pairs in bigrams:
if pairs[0].lower() in stopwords or pairs[1].lower() in stopwords:
continue
filtered.append(pairs)
fdist = nltk.FreqDist(filtered)
print fdist.keys()[:50]

>>> text = nltk.corpus.brown.words(categories = 'news')
>>> FreqBigram5(text)
[("''", '.'), ('.', '``'), ("''", ','), (';', ';'), (',', '``'), ('said', '.'), (',', 'Mrs.'), ('said', ','), ('.', 'Mr.'), (',', 'said'), ('New', 'York'), ('per', 'cent'), ('.', 'Mrs.'), (':', '``'), ('?', '?'), ('United', 'States'), ('year', '.'), (',', 'however'), ('last', 'year'), ('however', ','), (',', 'Mr.'), ('White', 'House'), ('last', 'week'), ('Jr.', ','), (')', '--'), ('week', '.'), (')', '.'), ('home', 'runs'), (',', 'according'), ('.', '('), ('.', 'One'), ("''", '?'), (')', ','), ('U.', 'S.'), ('year', ','), ('!', '!'), (',', 'including'), (',', 'would'), ('President', 'Kennedy'), ('years', '.'), ('daughter', ','), ('last', 'night'), (',', 'president'), (',', 'says'), ('wife', ','), ('San', 'Francisco'), ('time', ','), ('time', '.'), ('years', 'ago'), ('(', 'AP')]


Method 2) 
>>> def FreqBigram5(text):
 bigrams = nltk.bigrams(text)
 stopwords = nltk.corpus.stopwords.words('english')
 pairs = [p for p in bigrams if p[0].lower() not in stopwords and p[1].lower()not in stopwords]
 fdist= nltk.FreqDist(pairs)
 print fdist.keys()[:50]

 
>>> text = nltk.corpus.brown.words(categories = 'news')
>>> FreqBigram5(text)
[("''", '.'), ('.', '``'), ("''", ','), (';', ';'), (',', '``'), ('said', '.'), (',', 'Mrs.'), ('said', ','), ('.', 'Mr.'), (',', 'said'), ('New', 'York'), ('per', 'cent'), ('.', 'Mrs.'), (':', '``'), ('?', '?'), ('United', 'States'), ('year', '.'), (',', 'however'), ('last', 'year'), ('however', ','), (',', 'Mr.'), ('White', 'House'), ('last', 'week'), ('Jr.', ','), (')', '--'), ('week', '.'), (')', '.'), ('home', 'runs'), (',', 'according'), ('.', '('), ('.', 'One'), ("''", '?'), (')', ','), ('U.', 'S.'), ('year', ','), ('!', '!'), (',', 'including'), (',', 'would'), ('President', 'Kennedy'), ('years', '.'), ('daughter', ','), ('last', 'night'), (',', 'president'), (',', 'says'), ('wife', ','), ('San', 'Francisco'), ('time', ','), ('time', '.'), ('years', 'ago'), ('(', 'AP')]

Method 3)
>>> def FreqBigram5(text):
bigrams = nltk.bigrams(text)
stopwords = nltk.corpus.stopwords.words('english')
pairs = [tup for tup in bigrams if not False in [False for wrd in tup if wrd.lower() in stopwords]]
fdist = nltk.FreqDist(pairs)
print fdist.keys()[:50]

>>> text = nltk.corpus.brown.words(categories = 'news')
>>> FreqBigram5(text)
[("''", '.'), ('.', '``'), ("''", ','), (';', ';'), (',', '``'), ('said', '.'), (',', 'Mrs.'), ('said', ','), ('.', 'Mr.'), (',', 'said'), ('New', 'York'), ('per', 'cent'), ('.', 'Mrs.'), (':', '``'), ('?', '?'), ('United', 'States'), ('year', '.'), (',', 'however'), ('last', 'year'), ('however', ','), (',', 'Mr.'), ('White', 'House'), ('last', 'week'), ('Jr.', ','), (')', '--'), ('week', '.'), (')', '.'), ('home', 'runs'), (',', 'according'), ('.', '('), ('.', 'One'), ("''", '?'), (')', ','), ('U.', 'S.'), ('year', ','), ('!', '!'), (',', 'including'), (',', 'would'), ('President', 'Kennedy'), ('years', '.'), ('daughter', ','), ('last', 'night'), (',', 'president'), (',', 'says'), ('wife', ','), ('San', 'Francisco'), ('time', ','), ('time', '.'), ('years', 'ago'), ('(', 'AP')]

Reply all
Reply to author
Forward
0 new messages