The wordnet lemmatizer problem

3,114 views
Skip to first unread message

Raymond

unread,
Apr 14, 2010, 2:29:12 AM4/14/10
to nltk-users
Dear all,


The wordnet lemmatizer lemmatized 'was' into 'wa', it is so hilarious,
isn't is?
and also have many other mis-lemmatize problem.


Anyone know how to fix this?


Raymond

Sandra Derbring

unread,
Apr 14, 2010, 3:07:06 AM4/14/10
to nltk-...@googlegroups.com
Hi Raymond,

When I write 'morphy('was', 'verb')' into my python interpreter, I get 'be', so it seems like that's not a problem for the lemmatizer. If I have understood correctly, WordNet checks an exception list to see if the word has an irregular form and if the word is not found there, it goes on to strip the word from the end until it finds a word, going by the rules of detachment for each syntactic category. You can read more about it here: http://wordnet.princeton.edu/wordnet/man/morphy.7WN.html

It sounds like 'was' would be translated into 'wa' if you applied the verb detachment rules with an 's' ending. But I'm pretty sure 'was' is in the exception file for verbs. Check the file 'verb.exc'. Unless you happened to erase that line from the file (or didn't specify the right syntactic category), I can't really think of an explanation for your result. With what other words do you have problems? WordNet's morphological function does have bugs that allows the transforms of non-words, or result in forms not intended. However, it's my understanding that the word you get as a result must be a real word (found in WordNet).

I'm guessing there are people who know more about this than me. I also might be wrong about any of this, so I'd be interested in hearing more about the reasoning behind this problem.

Cheers,
Sandra

2010/4/14 Raymond <gunbus...@gmail.com>

--
You received this message because you are subscribed to the Google Groups "nltk-users" group.
To post to this group, send email to nltk-...@googlegroups.com.
To unsubscribe from this group, send email to nltk-users+...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/nltk-users?hl=en.


Raymond

unread,
Apr 14, 2010, 3:25:19 AM4/14/10
to nltk-users
Hi Sandra,
Thanks for your reply. Does wordnet library come with nltk?
I haven't manually install wordnet in my machine, will it cause the
problem? And where should the "verb.exc" be in?

Regards,
Raymond

On Apr 14, 3:07 pm, Sandra Derbring <sandra.derbr...@gmail.com> wrote:
> Hi Raymond,
>
> When I write 'morphy('was', 'verb')' into my python interpreter, I get 'be',
> so it seems like that's not a problem for the lemmatizer. If I have
> understood correctly, WordNet checks an exception list to see if the word
> has an irregular form and if the word is not found there, it goes on to
> strip the word from the end until it finds a word, going by the rules of
> detachment for each syntactic category. You can read more about it here:http://wordnet.princeton.edu/wordnet/man/morphy.7WN.html
>
> It sounds like 'was' would be translated into 'wa' if you applied the verb
> detachment rules with an 's' ending. But I'm pretty sure 'was' is in the
> exception file for verbs. Check the file 'verb.exc'. Unless you happened to
> erase that line from the file (or didn't specify the right syntactic
> category), I can't really think of an explanation for your result. With what
> other words do you have problems? WordNet's morphological function does have
> bugs that allows the transforms of non-words, or result in forms not
> intended. However, it's my understanding that the word you get as a result
> must be a real word (found in WordNet).
>
> I'm guessing there are people who know more about this than me. I also might
> be wrong about any of this, so I'd be interested in hearing more about the
> reasoning behind this problem.
>
> Cheers,
> Sandra
>

> 2010/4/14 Raymond <gunbuster...@gmail.com>


>
>
>
> > Dear all,
>
> > The wordnet lemmatizer lemmatized 'was' into 'wa', it is so hilarious,
> > isn't is?
> > and also have many other mis-lemmatize problem.
>
> > Anyone know how to fix this?
>
> > Raymond
>
> > --
> > You received this message because you are subscribed to the Google Groups
> > "nltk-users" group.
> > To post to this group, send email to nltk-...@googlegroups.com.
> > To unsubscribe from this group, send email to

> > nltk-users+...@googlegroups.com<nltk-users%2Bunsubscribe@googlegrou ps.com>

Sandra Derbring

unread,
Apr 14, 2010, 4:16:31 AM4/14/10
to nltk-...@googlegroups.com
My spontaneous reply would be that yes, WordNet comes with NLTK, but thinking of it, I'm not really sure. What system do you have? How did you start WordNet and get these results? Or how do you start any other NLTK section?

Any way, you can download the WordNet from this page: http://wordnet.princeton.edu/wordnet/download/
and there should the exc-files be so that you at least can have a look at them manually. And here is the NLTK download page:
http://www.nltk.org/download

2010/4/14 Raymond <gunbus...@gmail.com>
To unsubscribe from this group, send email to nltk-users+...@googlegroups.com.

Raymond

unread,
Apr 14, 2010, 4:36:10 AM4/14/10
to nltk-users
I use opensuse(a linux)
python 2.6.4
nltk 2.0b

here is my python code:
------------------------------------------------------
import nltk

raw = """DENNIS: Listen, strange women lying in ponds distributing
swords
was no basis for a system of government. Supreme executive power
derives from
a mandate from the masses, not from some farcical aquatic ceremony.
They were so stupid."""

tokens = nltk.word_tokenize(raw)

wnl = nltk.WordNetLemmatizer()
print [wnl.lemmatize(t) for t in tokens]
------------------------------------------------------------

and result:
---------------------------------------------------------
lamwaiman@cs6201:~/python/data/extract> python stem.py
['DENNIS', ':', 'Listen', ',', 'strange', 'woman', 'lying', 'in',
'pond', 'distributing', 'sword', 'wa', 'no', 'basis', 'for', 'a',
'system', 'of', 'government.', 'Supreme', 'executive', 'power',
'derives', 'from', 'a', 'mandate', 'from', 'the', 'mass', ',', 'not',
'from', 'some', 'farcical', 'aquatic', 'ceremony.', 'They', 'were',
'so', 'stupid', '.']
lamwaiman@cs6201:~/python/data/extract>

-------------------------------------------------------------

so 'was' -> 'wa'
and
'were' is still 'were'


On Apr 14, 4:16 pm, Sandra Derbring <sandra.derbr...@gmail.com> wrote:
> My spontaneous reply would be that yes, WordNet comes with NLTK, but
> thinking of it, I'm not really sure. What system do you have? How did you
> start WordNet and get these results? Or how do you start any other NLTK
> section?
>
> Any way, you can download the WordNet from this page:http://wordnet.princeton.edu/wordnet/download/
> and there should the exc-files be so that you at least can have a look at
> them manually. And here is the NLTK download page:http://www.nltk.org/download
>

> 2010/4/14 Raymond <gunbuster...@gmail.com>

Sandra Derbring

unread,
Apr 14, 2010, 5:12:54 AM4/14/10
to nltk-...@googlegroups.com
Well, I see that we use different methods, because I don't use the stemmer's version of the lemmatizer, only the morphy function by itself. I tried to test your code, but I can't get it to work. I don't know if it's that I have a different version of python (2.6), or linux (ubuntu) or nltk (3.0)? I can import as far as the wordnet module of the stemmer, but it refuses to find WordNetLemmatizer for me.

If you try:

from nltk.wordnet import *


morphy('was', 'verb')

What do you get for result? The disadvantage for you here is that you  have to specify the pos for each word and that's not really applicable if you want to put in a whole text with no tags...

Anyone else who knows why this doesn't work for me, or why the stemmer generates this result for Raymond?

2010/4/14 Raymond <gunbus...@gmail.com>
To unsubscribe from this group, send email to nltk-users+...@googlegroups.com.

Raymond

unread,
Apr 14, 2010, 5:18:34 AM4/14/10
to nltk-users
I found out the reason.
Here is the documentation of the wordnet lemmatizer in nltk:
of the function lemmatize():
def lemmatize(self, word, pos=NOUN):
26 lemma = _wordnet.morphy(word, pos)
27 if not lemma:
28 lemma = word
29 return lemma

it always assume the word is a noun and only do suffix remove of noun,
so the result of my above code is because of this.


I tried to do lemmatize(pos=v), then the 'was', 'were' were correctly
converted to 'be'
but the nouns were left out this time.

code:
------------------------------------------------------------------------------
print [wnl.lemmatize(t,pos=v) for t in tokens]

------------------------------------------------------------------------------
result:
------------------------------------------------------------------------------
['DENNIS', ':', 'Listen', ',', 'strange', 'women', 'lie', 'in',
'ponds', 'distribute', 'swords', 'be', 'no', 'basis', 'for', 'a',


'system', 'of', 'government.', 'Supreme', 'executive', 'power',

'derive', 'from', 'a', 'mandate', 'from', 'the', 'mass', ',', 'not',
'from', 'some', 'farcical', 'aquatic', 'ceremony.', 'They', 'be',


'so', 'stupid', '.']

---------------------------------------------------------
I am trying to :
1) find out all the possible pos used by this lemmatizer, because
nowhere in the doc tell me.
2) trying to loop through the available pos until a lemmatized word
appear

Raymond

unread,
Apr 14, 2010, 5:26:45 AM4/14/10
to nltk-users
Dear Sandra,

if pos must be provided, it seems that I fell into a catch-22.
Because I want to do suffix remove before POS tagging of a word(my
text is not tagged) to achieve better accuracy. End up I need to tag
the word before I use the lemmatizer.
What a contradiction.

Regards,
raymond

Sandra Derbring

unread,
Apr 14, 2010, 5:53:13 AM4/14/10
to nltk-...@googlegroups.com
Ah, so the lemmatizer uses the noun as the default value. Well, then at least we know why. But I do see your problem. I have to do this too in my application, loop through all poses when I lemmatize, which is why I use morphy. But I do want all the possible pos outcomes, so for me it's not an unnecessary code snippet.

But I use the lemmatizer as a process in my tagging, which is how you could see it too. If you loop through the poses and lemmatize the word, then you do have both the lemma and the pos tag in the end of it and you don't need to tag it again.

The problem would be to know if you should just take the first pos that is found correctly or if you should go further and somehow select from all possible poses you will get from one word - but this would be a problem when you tag anyway, wouldn't it? You'd need context to decide.

Anyway, the pos that can be used are the following:

1    NOUN
2    VERB
3    ADJECTIVE
4    ADVERB
5    ADJECTIVE_SATELLITE

However, adverb are not applicable, since they can't really be inflected. And if you use adj_sat, it will just be treated like the usual adjective. So the poses are actually noun, verb and adjective and I shorten them as 'noun', 'verb' and 'adj'.



2010/4/14 Raymond <gunbus...@gmail.com>
To unsubscribe from this group, send email to nltk-users+...@googlegroups.com.

Raymond

unread,
Apr 14, 2010, 6:20:08 AM4/14/10
to nltk-users
Hi Sandra,
As your mention "I have to do this too in my application, loop
through all poses when I lemmatize, which is why I use morphy." How do
you determine which lemmatization is the right one?

Raymond

On Apr 14, 5:53 pm, Sandra Derbring <sandra.derbr...@gmail.com> wrote:
> Ah, so the lemmatizer uses the noun as the default value. Well, then at
> least we know why. But I do see your problem. I have to do this too in my
> application, loop through all poses when I lemmatize, which is why I use
> morphy. But I do want all the possible pos outcomes, so for me it's not an
> unnecessary code snippet.
>
> But I use the lemmatizer as a process in my tagging, which is how you could
> see it too. If you loop through the poses and lemmatize the word, then you
> do have both the lemma and the pos tag in the end of it and you don't need
> to tag it again.
>
> The problem would be to know if you should just take the first pos that is
> found correctly or if you should go further and somehow select from all
> possible poses you will get from one word - but this would be a problem when
> you tag anyway, wouldn't it? You'd need context to decide.
>
> Anyway, the pos that can be used are the following:
>

> *1 *    NOUN
> *2 *    VERB
> *3 *    ADJECTIVE
> *4 *    ADVERB
> *5 *    ADJECTIVE_SATELLITE


>
> However, adverb are not applicable, since they can't really be inflected.
> And if you use adj_sat, it will just be treated like the usual adjective. So
> the poses are actually noun, verb and adjective and I shorten them as
> 'noun', 'verb' and 'adj'.
>

> 2010/4/14 Raymond <gunbuster...@gmail.com>

Sandra Derbring

unread,
Apr 14, 2010, 7:07:37 AM4/14/10
to nltk-...@googlegroups.com
Well, I don't, really. For each word, I get all possible poses and then I compare the frequency for those words. So inside each pos, I get a candidate that is most likely and then I compare the candidates from each pos. The pos that has the candidate with the highest frequency wins, although I store all other possibilities too so that you can choose to change if that wasn't the correct intention for the word in question. But I don't process text chunks, just word lists where the words aren't connected to each other.

I have to say, the frequency approach gets me pretty ok results. Most of the words get the right part of speech, although some adjectives/nouns, and adverbs/adjectives are mixed up. That might of course have something to do with that we have a verb indicator in that wordlist, so all verbs don't need to be compared to some other pos. That might have been a lot of mix-ups between them and nouns too.

2010/4/14 Raymond <gunbus...@gmail.com>
To unsubscribe from this group, send email to nltk-users+...@googlegroups.com.

Pedro Marcal

unread,
Apr 14, 2010, 7:57:11 AM4/14/10
to nltk-...@googlegroups.com
Hi Sandra,
Nice application of Zipf's law. I use this in my Japanese to English translator where I don't have a tagged Japanese Corpora. When you have a corpora, and at the cost of more computing, you can do much better by introducing Design Of Experiment methods.
Regards,
Pedro
Reply all
Reply to author
Forward
0 new messages