My proposal is a bit shaky...
i.e.
I would first parse text
tokenize it...
then I'd do a POS tag on it using a normal nltk.pos_tag() function
After which I'd iterate through the list of tuples containing token/
POS tag items.... in this case, a list comprehension to generate terms
(or tokens) that belong to either NN or NP type would seem the most
logical "term" to include
>>> import nltk
>>> sentence = "this is a great book, I'd love to read it sometime. It is by Grady Booch; he's a wonderful author"
>>> sentence = nltk.word_tokenize(sentence)
>>> sentence_tagged = nltk.pos_tag(sentence)
>>> term_extracts = [term for term,tag in sentence_tagged if tag == "NNP" or tag == "]
>>> term_extracts
['book', 'author']
Does anyone have a better approach to creating a term extraction
service?
>>> import nltk
>>> sentence = "this is a great book, I'd love to read it sometime. It is by Grady Booch; he's a wonderful author"
>>> sentence = nltk.word_tokenize(sentence)
>>> sentence_tagged = nltk.pos_tag(sentence)
>>> term_extracts = [term for term,tag in sentence_tagged if tag == "NN"]
>>> term_extracts
['book', 'author']
Does anyone have a better approach to creating a term extraction
service?
regards
How about reading up on named entity detection:
http://en.wikipedia.org/wiki/Named_entity_recognition
and also on IR approaches to term weighting, e.g. chapter 6 of the
Manning et al textbook:
http://nlp.stanford.edu/IR-book/information-retrieval-book.html
and then posting any follow-up questions to nltk-users (since this
isn't about NLTK development).
-Steven Bird
> --
> You received this message because you are subscribed to the Google Groups "nltk-dev" group.
> To post to this group, send email to nltk...@googlegroups.com.
> To unsubscribe from this group, send email to nltk-dev+u...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/nltk-dev?hl=en.
>
>