HFST (Helsinki Finite-State Technology, a research group at the University of Helsinki) has developed a Python package for doing lookup on weighted
finite-state transducers, and we believe this might prove useful as a
morphology component for NLTK.
Our main software package is a C++ library that interfaces between a lot of finite-state libraries and formalisms (OpenFst, SFST, foma, xfst), and we also develop specific finite-state applications on top of these. Often the end product is used via transducer lookup (or "apply down" or similar in xfst), where an input (say, an inflected word form) is transformed (into, say, a morphological analysis, base form, valence class, what have you). Classic tasks like morphological
analysis, morphological generation, stemming and hyphenation are
available for a number of languages, and there have been efforts into
things like synonym retrieval and inflecting translation dictionaries.
For lookup, there's native Python code (which is not that fast), and a C++ backend
with SWIG bindings, which is quite fast. Take a look at the Python
package here https://sourceforge.net/projects/hfst/files/optimized-lookup/
and see if you can run the stuff in hfst-optimized-lookup-python. /swig contains the fast C++ stuff.
>>> import hfst_lookup
>>> t = hfst_lookup.Transducer("french.hfst.ol")
>>> results = t.lookup("passant")
>>> # the results come back as a tuple of pairs of the output and a weight, zero here for an unweighted transducer
>>> # or let's get the possible analyses for each word in the sentence
>>> sentence = "ma tante est fou"
>>> map(lambda x: (x, t.lookup(x)), sentence.split())
[('ma', (('mon+functionWord', 0.0), ('ma+functionWord', 0.0))),
('tante', (('tante+commonNoun+feminine+singular', 0.0),)), ('est',
('fou', (('fou+adjective+singular+masculine', 0.0),
The package is ready for playing around NLTK with, but in principle it
could also be included in NLTK proper if it proves useful. The relevant code is currently
dually licensed under Apache and GPL3.
We are aware that inclusion in NLTK itself might require looking into
compatibility and coding standards, and could be able to help with that.