Help me with best use of NLTK API for this research

Edmon

unread,

Mar 11, 2012, 3:16:19 PM3/11/12

to nltk-...@googlegroups.com

I am new to NLTK but I am absolutely impressed and enamored by this package.

I am using it for my research in computational linguistics and I am learning as I go.

I have a particular statistic I would like to collect and I am hoping someone on this list could help

me with the tips for implementation.

I would like to study patterns of the word groupings and their frequencies around the particular families of words.

For example, I would like to look for all occurrences of the word "dislike" with all of its variations (inflections,...)

and what are the most common words that precede and follow it at the sentence level.

I imagine I would take the raw text (already know how to do it), break it down into sentences (know how to do this)

and then within sentence search for the occurrences of the word in all of its forms and start building a set

of pre and post patterns (pre-bigram, pre-trigram, etc) and same for the post.

For example, lets take this simple, made up text:

"Every Sunday they gather at the Mall. She dislikes the crowd, but she likes the company of her friends.

Movie that they are going to see is a typical blockbuster. Her boyfriend likes movies like that. She does not."

From this text I would collect:

(She, dislikes)

(dislikes, the)

(she, likes)

(likes, the)

(boyfriend, likes)

(likes, movies)

(She, dislikes)

(dislikes, the, crowd)

(but, she, likes)

(likes, the, company)

(Her, boyfriend, likes)

(likes, movies, like)

...

and then I would finally collect statistics on the frequencies of the particular patterns. (know how to do it)

Would someone please suggest a tip or an approach on how to do the neighborhood pattern collection part

on like words at the sentence level using NLTK API.

Thank you in advance,

Edmon

Morten Minde Neergaard

unread,

Mar 11, 2012, 4:53:40 PM3/11/12

to nltk-...@googlegroups.com

At 12:16, Sun 2012-03-11, Edmon wrote:
[…]

> From this text I would collect:
>
> (She, dislikes)
> (dislikes, the)
> (she, likes)
> (likes, the)
> (boyfriend, likes)
> (likes, movies)

[…]

This approach might give you what you want. Exchange the 2 for 3 to get
trigrams.

filter_words = ('like', 'likes', 'dislike', 'dislikes', 'enjoy', 'enjoys')
filter(lambda gram: gram[0] in filter_words or gram[-1] in filter_words,
nltk.ngrams(nltk.tokenize.word_tokenize(text), 2))

You may want to use a tagger if you want to avoid false positives on e.g
«I am not like you» while keeping e.g «I do not like you».

Cheers,
--
Morten Minde Neergaard

Edmon

unread,

Mar 11, 2012, 6:09:49 PM3/11/12

to nltk-...@googlegroups.com

Thanks Morten.

Reply all

Reply to author

Forward