Term extraction

Victor Miclovich

unread,

Mar 1, 2010, 4:58:41 PM3/1/10

to nltk-dev

Has anyone tried created term extraction service such as those done by
Yahoo's term extractor API? How would you do it?

My proposal is a bit shaky...
i.e.
I would first parse text
tokenize it...
then I'd do a POS tag on it using a normal nltk.pos_tag() function

After which I'd iterate through the list of tuples containing token/
POS tag items.... in this case, a list comprehension to generate terms
(or tokens) that belong to either NN or NP type would seem the most
logical "term" to include

>>> import nltk
>>> sentence = "this is a great book, I'd love to read it sometime. It is by Grady Booch; he's a wonderful author"
>>> sentence = nltk.word_tokenize(sentence)
>>> sentence_tagged = nltk.pos_tag(sentence)
>>> term_extracts = [term for term,tag in sentence_tagged if tag == "NNP" or tag == "]
>>> term_extracts
['book', 'author']

Does anyone have a better approach to creating a term extraction
service?

Victor Miclovich

unread,

Mar 1, 2010, 5:01:56 PM3/1/10

to nltk-dev

I had a typo in my previous code... look at this instead

>>> import nltk
>>> sentence = "this is a great book, I'd love to read it sometime. It is by Grady Booch; he's a wonderful author"
>>> sentence = nltk.word_tokenize(sentence)
>>> sentence_tagged = nltk.pos_tag(sentence)

>>> term_extracts = [term for term,tag in sentence_tagged if tag == "NN"]
>>> term_extracts
['book', 'author']

Does anyone have a better approach to creating a term extraction
service?

regards

Steven Bird

unread,

Mar 1, 2010, 5:48:27 PM3/1/10

to nltk...@googlegroups.com

Victor,

How about reading up on named entity detection:
http://en.wikipedia.org/wiki/Named_entity_recognition

and also on IR approaches to term weighting, e.g. chapter 6 of the
Manning et al textbook:
http://nlp.stanford.edu/IR-book/information-retrieval-book.html

and then posting any follow-up questions to nltk-users (since this
isn't about NLTK development).

-Steven Bird

> --
> You received this message because you are subscribed to the Google Groups "nltk-dev" group.
> To post to this group, send email to nltk...@googlegroups.com.
> To unsubscribe from this group, send email to nltk-dev+u...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/nltk-dev?hl=en.
>
>

Reply all

Reply to author

Forward