Stop tokenizers from splitting possessives

1,173 views
Skip to first unread message

tom.pr...@gmail.com

unread,
Feb 13, 2015, 4:09:18 PM2/13/15
to nltk-...@googlegroups.com
I am working on tagging sentences and am having a problem with sentence/word tokenization.  The problem is that possessive nouns like Bob's becomes Bob/NP , 's/NN
What I am trying to get is Bob's/NP or whatever tag corresponds to noun possessive.  How do I get the n-gram taggers and the word/sentence tokenizers to honor possessives.

Thanks.


Fred Mailhot

unread,
Feb 13, 2015, 4:56:17 PM2/13/15
to nltk-...@googlegroups.com
The different tokenizers will split those differently.

 $ python
>>> from nltk import WordPunctTokenizer
>>> from nltk import TreebankWordTokenizer
>>> from nltk.tag import pos_tag
>>> wpt = WordPunctTokenizer()
>>> tbt = TreebankWordTokenizer()
>>> wpt.tokenize("This is Bob's sandwich.")
['This', 'is', 'Bob', "'", 's', 'sandwich', '.']
>>> tbt.tokenize("This is Bob's sandwich.")
['This', 'is', 'Bob', "'s", 'sandwich', '.']
>>> pos_tag(wpt.tokenize("This is Bob's sandwich."))
[('This', 'DT'), ('is', 'VBZ'), ('Bob', 'NNP'), ("'", 'POS'), ('s', 'NNS'), ('sandwich', 'VBP'), ('.', '.')]
>>> pos_tag(tbt.tokenize("This is Bob's sandwich."))
[('This', 'DT'), ('is', 'VBZ'), ('Bob', 'NNP'), ("'s", 'POS'), ('sandwich', 'NN'), ('.', '.')]
>>>

--
You received this message because you are subscribed to the Google Groups "nltk-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to nltk-users+...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Alexis Dimitriadis

unread,
Feb 13, 2015, 5:51:48 PM2/13/15
to nltk-...@googlegroups.com
What I am trying to get is Bob's/NP or whatever tag corresponds to noun possessive.  How do I get the n-gram taggers and the word/sentence tokenizers to honor possessives.

On 13 Feb 2015, at 22:56, Fred Mailhot <fred.m...@gmail.com> wrote:

The different tokenizers will split those differently.

Whatever tokenizer you select, you'll need to use a part of speech tagger that's been trained with the same tokenization style. 

The Brown corpus is tokenized in the style you want, so maybe you can use it as a starting point: "Bob's" would be tagged NP$ (noun, proper, genitive).

Alexis

Dr. Alexis Dimitriadis | Assistant Professor and Senior Research Fellow | Utrecht Institute of Linguistics OTS | Utrecht University | Trans 10, 3512 JK Utrecht, room 2.33 | +31 30 253 65 68 | a.dimi...@uu.nl | www.hum.uu.nl/medewerkers/a.dimitriadis

Tom Merriewether

unread,
Feb 17, 2015, 9:03:18 AM2/17/15
to nltk-...@googlegroups.com
Thanks.  My problem seems to be that the possessive apostrophe causes a python problem. I am extracting corpus text from an XML document and am replacing the unicode sequence with a real apostrophe and writing the text out as utf-8.  For some reason the Tnt tagger I am using fails on these inputs and throws an exception.  I am struggling to find a way around this since I am a relative newcomer to python and the whold ascii/unicode mash-up is confusing.


--
You received this message because you are subscribed to a topic in the Google Groups "nltk-users" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/nltk-users/u3qSUPFrsl4/unsubscribe.
To unsubscribe from this group and all its topics, send an email to nltk-users+...@googlegroups.com.

Tom Merriewether

unread,
Feb 17, 2015, 9:06:17 AM2/17/15
to nltk-...@googlegroups.com
Thanks.

I am actually training on the brown corpus (the learned section/segment of the corpus).  I'll look at this again since it should properly identify the genitive case.  Thanks again.


--
You received this message because you are subscribed to a topic in the Google Groups "nltk-users" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/nltk-users/u3qSUPFrsl4/unsubscribe.
To unsubscribe from this group and all its topics, send an email to nltk-users+...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages