I've been looking into looking up single words in a dictionary as part
of a project. To allow input in non-canonical form, I'm using the method
nltk.corpus.wordnet.morphy. This works very well for English.
For other languages, the stemmers I've found in NLTK return forms
unsuitable for dictionary lookup, e.g. in Spanish:
nltk.stem.snowball.SpanishStemmer().stem('logicamente')
u'logic'
Does anyone have a good idea for how to handle this? A worst case
scenario would be using lemmatizers not written in python, but I'd like
to avoid this.
Cheers,
--
Morten Minde Neergaard
Morphological analysis is kind of a hard problem for many languages!
You may have to find a language-specific tool in a lot of cases, and
many of them may not be in Python.
But if you want to do Spanish, Mike Gasser (my advisor) has some
Python 3 software that works pretty well for Spanish verbs. In many
cases (don't know the precision/recall), it will find the infinitive
form of a verb, given the conjugated form. There's also morphological
analyzers for a few other languages here:
http://www.cs.indiana.edu/~gasser/Research/software.html
Hope this helps!
--
-- alexr
Hi! Sorry that I didn't remember to say “thank you” right away =)
> Morphological analysis is kind of a hard problem for many languages!
> You may have to find a language-specific tool in a lot of cases, and
> many of them may not be in Python.
Indeed, and lots of the tools are closed source and/or have rotten code
bases.
> But if you want to do Spanish, Mike Gasser (my advisor) has some
> Python 3 software that works pretty well for Spanish verbs. In many
> cases (don't know the precision/recall), it will find the infinitive
> form of a verb, given the conjugated form. There's also morphological
> analyzers for a few other languages here:
In case anyone cares, I ended up writing a small wrapper around
TreeTagger. It's giving me good results.
http://www.ims.uni-stuttgart.de/projekte/corplex/TreeTagger/DecisionTreeTagger.html
--
Morten Minde Neergaard