I am new to NLTK but I am absolutely impressed and enamored by this
package.
I am using it for my research in computational linguistics and I am
learning as I go.
I have a particular statistic I would like to collect and I am hoping
someone on this list could help
me with the tips for implementation.
I would like to study patterns of the word groupings and
their frequencies around the particular families of words.
For example, I would like to look for all occurrences of the word "dislike"
with all of its variations (inflections,...)
and what are the most common words that precede and follow it at the
sentence level.
I imagine I would take the raw text (already know how to do it), break it
down into sentences (know how to do this)
and then within sentence search for the occurrences of the word in all of
its forms and start building a set
of pre and post patterns (pre-bigram, pre-trigram, etc) and same for the
post.
For example, lets take this simple, made up text:
"Every Sunday they gather at the Mall. She dislikes the crowd, but she
likes the company of her friends.
Movie that they are going to see is a typical blockbuster. Her boyfriend
likes movies like that. She does not."
From this text I would collect:
(She, dislikes)
(dislikes, the)
(she, likes)
(likes, the)
(boyfriend, likes)
(likes, movies)
(She, dislikes)
(dislikes, the, crowd)
(but, she, likes)
(likes, the, company)
(Her, boyfriend, likes)
(likes, movies, like)
...
and then I would finally collect statistics on the frequencies of the
particular patterns. (know how to do it)
Would someone please suggest a tip or an approach on how to do the
neighborhood pattern collection part
on like words at the sentence level using NLTK API.
Thank you in advance,
Edmon