Stop tokenizers from splitting possessives

tom.pr...@gmail.com

unread,

Feb 13, 2015, 4:09:18 PM2/13/15

to nltk-...@googlegroups.com

I am working on tagging sentences and am having a problem with sentence/word tokenization. The problem is that possessive nouns like Bob's becomes Bob/NP , 's/NN

What I am trying to get is Bob's/NP or whatever tag corresponds to noun possessive. How do I get the n-gram taggers and the word/sentence tokenizers to honor possessives.

Thanks.

Fred Mailhot

unread,

Feb 13, 2015, 4:56:17 PM2/13/15

to nltk-...@googlegroups.com

The different tokenizers will split those differently.

$ python

>>> from nltk import WordPunctTokenizer

>>> from nltk import TreebankWordTokenizer

>>> from nltk.tag import pos_tag

>>> wpt = WordPunctTokenizer()

>>> tbt = TreebankWordTokenizer()

>>> wpt.tokenize("This is Bob's sandwich.")

['This', 'is', 'Bob', "'", 's', 'sandwich', '.']

>>> tbt.tokenize("This is Bob's sandwich.")

['This', 'is', 'Bob', "'s", 'sandwich', '.']

>>> pos_tag(wpt.tokenize("This is Bob's sandwich."))

[('This', 'DT'), ('is', 'VBZ'), ('Bob', 'NNP'), ("'", 'POS'), ('s', 'NNS'), ('sandwich', 'VBP'), ('.', '.')]

>>> pos_tag(tbt.tokenize("This is Bob's sandwich."))

[('This', 'DT'), ('is', 'VBZ'), ('Bob', 'NNP'), ("'s", 'POS'), ('sandwich', 'NN'), ('.', '.')]

>>>

--
You received this message because you are subscribed to the Google Groups "nltk-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to nltk-users+...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Alexis Dimitriadis

unread,

Feb 13, 2015, 5:51:48 PM2/13/15

to nltk-...@googlegroups.com

What I am trying to get is Bob's/NP or whatever tag corresponds to noun possessive. How do I get the n-gram taggers and the word/sentence tokenizers to honor possessives.

On 13 Feb 2015, at 22:56, Fred Mailhot <fred.m...@gmail.com> wrote:

The different tokenizers will split those differently.

Whatever tokenizer you select, you'll need to use a part of speech tagger that's been trained with the same tokenization style.

The Brown corpus is tokenized in the style you want, so maybe you can use it as a starting point: "Bob's" would be tagged NP$ (noun, proper, genitive).

Alexis

Tom Merriewether

unread,

Feb 17, 2015, 9:03:18 AM2/17/15

to nltk-...@googlegroups.com

Thanks. My problem seems to be that the possessive apostrophe causes a python problem. I am extracting corpus text from an XML document and am replacing the unicode sequence with a real apostrophe and writing the text out as utf-8. For some reason the Tnt tagger I am using fails on these inputs and throws an exception. I am struggling to find a way around this since I am a relative newcomer to python and the whold ascii/unicode mash-up is confusing.

--
You received this message because you are subscribed to a topic in the Google Groups "nltk-users" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/nltk-users/u3qSUPFrsl4/unsubscribe.
To unsubscribe from this group and all its topics, send an email to nltk-users+...@googlegroups.com.

Tom Merriewether

unread,

Feb 17, 2015, 9:06:17 AM2/17/15

to nltk-...@googlegroups.com

Thanks.

I am actually training on the brown corpus (the learned section/segment of the corpus). I'll look at this again since it should properly identify the genitive case. Thanks again.

--
You received this message because you are subscribed to a topic in the Google Groups "nltk-users" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/nltk-users/u3qSUPFrsl4/unsubscribe.
To unsubscribe from this group and all its topics, send an email to nltk-users+...@googlegroups.com.

Reply all

Reply to author

Forward