Help me with best use of NLTK API for this research

31 views
Skip to first unread message

Edmon

unread,
Mar 11, 2012, 3:16:19 PM3/11/12
to nltk-...@googlegroups.com
I am new to NLTK but I am absolutely impressed and enamored by this package. 

I am using it for my research in computational linguistics and I am learning as I go.
I have a particular statistic I would like to collect and I am hoping someone on this list could help 
me with the tips for implementation.

I would like to study patterns of the word groupings and their frequencies around the particular families of words.

For example, I would like to look for all occurrences of the word "dislike" with all of its variations (inflections,...)
and what are the most common words that precede and follow it at the sentence level.   

I imagine I would take the raw text (already know how to do it), break it down into sentences (know how to do this)
and then within sentence search for the occurrences of the word in all of its forms and start building a set
of pre and post patterns (pre-bigram, pre-trigram, etc) and same for the post.

For example, lets take this simple, made up text:

"Every Sunday they gather at the Mall. She dislikes the crowd, but she likes the company of her friends.
Movie that they are going to see is a typical blockbuster. Her boyfriend likes movies like that. She does not."

From this text I would collect:

(She, dislikes)
(dislikes, the)
(she, likes)
(likes, the)
(boyfriend, likes)
(likes, movies)

(She, dislikes)
(dislikes, the, crowd)
(but, she, likes)
(likes, the, company)
(Her, boyfriend, likes)
(likes, movies, like)
...

and then I would finally collect statistics on the frequencies of the particular patterns. (know how to do it)

Would someone please suggest a tip or an approach on how to do the neighborhood pattern collection part
on like words at the sentence level using NLTK API.

Thank you in advance,
Edmon










 

Morten Minde Neergaard

unread,
Mar 11, 2012, 4:53:40 PM3/11/12
to nltk-...@googlegroups.com
At 12:16, Sun 2012-03-11, Edmon wrote:
[…]

> From this text I would collect:
>
> (She, dislikes)
> (dislikes, the)
> (she, likes)
> (likes, the)
> (boyfriend, likes)
> (likes, movies)
[…]

This approach might give you what you want. Exchange the 2 for 3 to get
trigrams.

filter_words = ('like', 'likes', 'dislike', 'dislikes', 'enjoy', 'enjoys')
filter(lambda gram: gram[0] in filter_words or gram[-1] in filter_words,
nltk.ngrams(nltk.tokenize.word_tokenize(text), 2))

You may want to use a tagger if you want to avoid false positives on e.g
«I am not like you» while keeping e.g «I do not like you».

Cheers,
--
Morten Minde Neergaard

Edmon

unread,
Mar 11, 2012, 6:09:49 PM3/11/12
to nltk-...@googlegroups.com
Thanks Morten.
Reply all
Reply to author
Forward
0 new messages