The wordnet lemmatizer lemmatized 'was' into 'wa', it is so hilarious,
isn't is?
and also have many other mis-lemmatize problem.
Anyone know how to fix this?
Raymond
--
You received this message because you are subscribed to the Google Groups "nltk-users" group.
To post to this group, send email to nltk-...@googlegroups.com.
To unsubscribe from this group, send email to nltk-users+...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/nltk-users?hl=en.
Regards,
Raymond
On Apr 14, 3:07 pm, Sandra Derbring <sandra.derbr...@gmail.com> wrote:
> Hi Raymond,
>
> When I write 'morphy('was', 'verb')' into my python interpreter, I get 'be',
> so it seems like that's not a problem for the lemmatizer. If I have
> understood correctly, WordNet checks an exception list to see if the word
> has an irregular form and if the word is not found there, it goes on to
> strip the word from the end until it finds a word, going by the rules of
> detachment for each syntactic category. You can read more about it here:http://wordnet.princeton.edu/wordnet/man/morphy.7WN.html
>
> It sounds like 'was' would be translated into 'wa' if you applied the verb
> detachment rules with an 's' ending. But I'm pretty sure 'was' is in the
> exception file for verbs. Check the file 'verb.exc'. Unless you happened to
> erase that line from the file (or didn't specify the right syntactic
> category), I can't really think of an explanation for your result. With what
> other words do you have problems? WordNet's morphological function does have
> bugs that allows the transforms of non-words, or result in forms not
> intended. However, it's my understanding that the word you get as a result
> must be a real word (found in WordNet).
>
> I'm guessing there are people who know more about this than me. I also might
> be wrong about any of this, so I'd be interested in hearing more about the
> reasoning behind this problem.
>
> Cheers,
> Sandra
>
> 2010/4/14 Raymond <gunbuster...@gmail.com>
>
>
>
> > Dear all,
>
> > The wordnet lemmatizer lemmatized 'was' into 'wa', it is so hilarious,
> > isn't is?
> > and also have many other mis-lemmatize problem.
>
> > Anyone know how to fix this?
>
> > Raymond
>
> > --
> > You received this message because you are subscribed to the Google Groups
> > "nltk-users" group.
> > To post to this group, send email to nltk-...@googlegroups.com.
> > To unsubscribe from this group, send email to
> > nltk-users+...@googlegroups.com<nltk-users%2Bunsubscribe@googlegrou ps.com>
To unsubscribe from this group, send email to nltk-users+...@googlegroups.com.
here is my python code:
------------------------------------------------------
import nltk
raw = """DENNIS: Listen, strange women lying in ponds distributing
swords
was no basis for a system of government. Supreme executive power
derives from
a mandate from the masses, not from some farcical aquatic ceremony.
They were so stupid."""
tokens = nltk.word_tokenize(raw)
wnl = nltk.WordNetLemmatizer()
print [wnl.lemmatize(t) for t in tokens]
------------------------------------------------------------
and result:
---------------------------------------------------------
lamwaiman@cs6201:~/python/data/extract> python stem.py
['DENNIS', ':', 'Listen', ',', 'strange', 'woman', 'lying', 'in',
'pond', 'distributing', 'sword', 'wa', 'no', 'basis', 'for', 'a',
'system', 'of', 'government.', 'Supreme', 'executive', 'power',
'derives', 'from', 'a', 'mandate', 'from', 'the', 'mass', ',', 'not',
'from', 'some', 'farcical', 'aquatic', 'ceremony.', 'They', 'were',
'so', 'stupid', '.']
lamwaiman@cs6201:~/python/data/extract>
-------------------------------------------------------------
so 'was' -> 'wa'
and
'were' is still 'were'
On Apr 14, 4:16 pm, Sandra Derbring <sandra.derbr...@gmail.com> wrote:
> My spontaneous reply would be that yes, WordNet comes with NLTK, but
> thinking of it, I'm not really sure. What system do you have? How did you
> start WordNet and get these results? Or how do you start any other NLTK
> section?
>
> Any way, you can download the WordNet from this page:http://wordnet.princeton.edu/wordnet/download/
> and there should the exc-files be so that you at least can have a look at
> them manually. And here is the NLTK download page:http://www.nltk.org/download
>
> 2010/4/14 Raymond <gunbuster...@gmail.com>
To unsubscribe from this group, send email to nltk-users+...@googlegroups.com.
it always assume the word is a noun and only do suffix remove of noun,
so the result of my above code is because of this.
I tried to do lemmatize(pos=v), then the 'was', 'were' were correctly
converted to 'be'
but the nouns were left out this time.
code:
------------------------------------------------------------------------------
print [wnl.lemmatize(t,pos=v) for t in tokens]
------------------------------------------------------------------------------
result:
------------------------------------------------------------------------------
['DENNIS', ':', 'Listen', ',', 'strange', 'women', 'lie', 'in',
'ponds', 'distribute', 'swords', 'be', 'no', 'basis', 'for', 'a',
'system', 'of', 'government.', 'Supreme', 'executive', 'power',
'derive', 'from', 'a', 'mandate', 'from', 'the', 'mass', ',', 'not',
'from', 'some', 'farcical', 'aquatic', 'ceremony.', 'They', 'be',
'so', 'stupid', '.']
---------------------------------------------------------
I am trying to :
1) find out all the possible pos used by this lemmatizer, because
nowhere in the doc tell me.
2) trying to loop through the available pos until a lemmatized word
appear
if pos must be provided, it seems that I fell into a catch-22.
Because I want to do suffix remove before POS tagging of a word(my
text is not tagged) to achieve better accuracy. End up I need to tag
the word before I use the lemmatizer.
What a contradiction.
Regards,
raymond
1 NOUN
2 VERB
3 ADJECTIVE
4 ADVERB
5 ADJECTIVE_SATELLITE
However, adverb are not applicable, since they can't really be
inflected. And if you use adj_sat, it will just be treated like the
usual adjective. So the poses are actually noun, verb and adjective and I
shorten them as 'noun', 'verb' and 'adj'.
To unsubscribe from this group, send email to nltk-users+...@googlegroups.com.
Raymond
On Apr 14, 5:53 pm, Sandra Derbring <sandra.derbr...@gmail.com> wrote:
> Ah, so the lemmatizer uses the noun as the default value. Well, then at
> least we know why. But I do see your problem. I have to do this too in my
> application, loop through all poses when I lemmatize, which is why I use
> morphy. But I do want all the possible pos outcomes, so for me it's not an
> unnecessary code snippet.
>
> But I use the lemmatizer as a process in my tagging, which is how you could
> see it too. If you loop through the poses and lemmatize the word, then you
> do have both the lemma and the pos tag in the end of it and you don't need
> to tag it again.
>
> The problem would be to know if you should just take the first pos that is
> found correctly or if you should go further and somehow select from all
> possible poses you will get from one word - but this would be a problem when
> you tag anyway, wouldn't it? You'd need context to decide.
>
> Anyway, the pos that can be used are the following:
>
> *1 * NOUN
> *2 * VERB
> *3 * ADJECTIVE
> *4 * ADVERB
> *5 * ADJECTIVE_SATELLITE
>
> However, adverb are not applicable, since they can't really be inflected.
> And if you use adj_sat, it will just be treated like the usual adjective. So
> the poses are actually noun, verb and adjective and I shorten them as
> 'noun', 'verb' and 'adj'.
>
> 2010/4/14 Raymond <gunbuster...@gmail.com>
To unsubscribe from this group, send email to nltk-users+...@googlegroups.com.