Forcing a word to have a particular part of speech (POS) tag and running the standard tagger otherwise?

Andrew S.

unread,

Dec 13, 2009, 2:09:43 PM12/13/09

to nltk-users

My apologies if this has already been answered, but from my searching
on the discussions it doesn't seem to be.

In brief, I have been using the "standard" NLTK POS-tagger to tag
sentences, i.e., running code like the following:

tokenized_text = nltk.word_tokenize(raw_text)
pos_tagged_text = nltk.pos_tag(tokenized_text)

where "raw_text" is an english-text sentence. The results I obtain
are generally fine; however, I would like to have certain default POS
tags for certain words overriden, e.g., I would like to assign the
word "said" to be a determiner (/DT) type.

I can force such changes after I generate the POS-tagged text, by
iterating through each of the tagged word/tag tuples, identifying
tuples where the word is the one I want to change the tag for, and
then changing the tag. But this seems crude and ad-hoc, so my
question is, is there any way to force certain tag definitions for the
standard tagger? Is there a dictionary that I can selectively
override?

Thanks in advance for any help!

Best,

Andrew

Steven Bird

unread,

Dec 13, 2009, 5:59:29 PM12/13/09

to nltk-...@googlegroups.com

2009/12/14 Andrew S. <andrews...@gmail.com>:
> ... I would like to have certain default POS

> tags for certain words overriden, e.g., I would like to assign the
> word "said" to be a determiner (/DT) type.

Your suggested solution seems ok to me, though it risks creating tag
sequences that aren't attested in the corpus on which the tagger was
originally trained, which might be a problem.

Any solution that overrode the tag during the tagging process, as you
suggest, risks causing more serious problems. For instance, a simple
bigram tagger would be led astray if you changed a tag during
processing, creating a sequence of tags it hadn't seen during
training.

A better approach would be to create a tagged corpus using your
favourite tagger, correct it in whatever way you like, then train a
new tagger on that corpus.

-Steven Bird

Jacob Perkins

unread,

Dec 16, 2009, 2:25:43 PM12/16/09

to nltk-users

I agree with Steven. However, there are cases when you want to force
an override. I've been doing that by creating a very simple corpus
file, one word/tag per line, then feeding it to a UnigramTagger that
goes at the beginning of the sequential backoff chain. You do have to
be careful with this because it can screw up the rest of taggers in
the chain.

Jacob

On Dec 13, 2:59 pm, Steven Bird <s...@csse.unimelb.edu.au> wrote:
> 2009/12/14 Andrew S. <andrewschein...@gmail.com>:

Reply all

Reply to author

Forward