HFST as a morphology/FST component in NLTK

527 views
Skip to first unread message

Sam Hardwick

unread,
Apr 26, 2012, 3:35:45 PM4/26/12
to nltk...@googlegroups.com
HFST (Helsinki Finite-State Technology, a research group at the University of Helsinki) has developed a Python package for doing lookup on weighted finite-state transducers, and we believe this might prove useful as a morphology component for NLTK.

Our main software package is a C++ library that interfaces between a lot of finite-state libraries and formalisms (OpenFst, SFST, foma, xfst), and we also develop specific finite-state applications on top of these. Often the end product is used via transducer lookup (or "apply down" or similar in xfst), where an input (say, an inflected word form) is transformed (into, say, a morphological analysis, base form, valence class, what have you). Classic tasks like morphological analysis, morphological generation, stemming and hyphenation are available for a number of languages, and there have been efforts into things like synonym retrieval and inflecting translation dictionaries.

For lookup, there's native Python code (which is not that fast), and a C++ backend with SWIG bindings, which is quite fast. Take a look at the Python package here https://sourceforge.net/projects/hfst/files/optimized-lookup/ and see if you can run the stuff in hfst-optimized-lookup-python. /swig contains the fast C++ stuff.

There are a number of prebuilt morphological transducers at https://sourceforge.net/projects/hfst/files/morphological-transducers/ . For instance, if you get hfst-french.tar.gz and successfully install the lookup package, a sample session might be

>>> import hfst_lookup
>>> t = hfst_lookup.Transducer("french.hfst.ol") 
>>> results = t.lookup("passant")
>>> print(results)

(('passer+verb+participle+present', 0.0), ('passant+adjective+singular+masculine', 0.0), ('passant+commonNoun+masculine+singular', 0.0)) 

>>> # the results come back as a tuple of pairs of the output and a weight, zero here for an unweighted transducer

>>> # or let's get the possible analyses for each word in the sentence

>>> sentence = "ma tante est fou"
>>> map(lambda x: (x, t.lookup(x)), sentence.split()) 

[('ma', (('mon+functionWord', 0.0), ('ma+functionWord', 0.0))), ('tante', (('tante+commonNoun+feminine+singular', 0.0),)), ('est', (('est+commonNoun+masculine+singular', 0.0), ('être+verb+singular+indicative+present+thirdPerson', 0.0))), ('fou', (('fou+adjective+singular+masculine', 0.0), ('fou+commonNoun+masculine+singular', 0.0)))]

The package is ready for playing around NLTK with, but in principle it could also be included in NLTK proper if it proves useful. The relevant code is currently dually licensed under Apache and GPL3.

We are aware that inclusion in NLTK itself might require looking into compatibility and coding standards, and could be able to help with that.

Alex Rudnick

unread,
Apr 27, 2012, 12:30:19 AM4/27/12
to nltk...@googlegroups.com
Hey Sam,

This sounds really neat! People occasionally ask on the nltk-users
list how to best do FSTs with Python; I think pyopenfst is not
well-maintained, last I heard.

This package could be really useful for my group's work on MT for
morphologically rich languages: it would be cool if our transduction
was done by some fast C++ code underneath.

If you'd like to get this integrated into NLTK, maybe a good first
step would be making a package in nltk-contrib? I would be happy to
help with this, if you need help. Then maybe we could get discussion
going about integrating it into core NLTK, if the rest of the group
thinks it's a good idea. (but putting things into nltk-contrib can be
done basically unilaterally, as I understand it.)
Thanks, and cheers!

--
-- alexr

Steven Bird

unread,
Apr 28, 2012, 1:32:43 AM4/28/12
to nltk...@googlegroups.com
Hi Sam,

Let me echo Alex's enthusiasm for this! Morphology is a major hole in
NLTK. The ideal solution would include a pure Python implementation,
and wrappers for state-of-the-art implementations in other languages.

It looks like this is what you're offering, at least for the lookup
problem (morphological analysis). I've tested the Python and SWIG
interfaces and they work nicely. But where do things stand with
synthesis, and with compiling efficient FSTs from some high-level
language (ideally the whole FS algebra)?

I see that the swig interface uses GPLv3, which isn't compatible with
Apache. Are you able to change it to Apache?

As for NLTK's coding standards, here's the starting point:
http://code.google.com/p/nltk/wiki/DevelopersGuide
(let's go offline with any low-level issus here)

As for the process, another option to the one Alex suggested would be
for you to clone NLTK on github and add your package there.

There's a more general issue, that we probably cannot fully resolve
here, namely the design of a generic FST API for NLTK. Your
contribution would then be one implementation of this API, and it
would be possible to add other implementations later. I'm happy for
us to make a "best effort" and let it evolve as the need arises, but I
would encourage you to think along these lines from the start. Most
of NLTK's subpackages are examples of this approach.

Thanks,
-Steven Bird

Sam Hardwick

unread,
Apr 30, 2012, 4:59:16 AM4/30/12
to nltk...@googlegroups.com
If you'd like to get this integrated into NLTK, maybe a good first
step would be making a package in nltk-contrib?

Sounds like a good idea - I don't have the time to do this right now,  but I'll see if I can figure it out when I get back from travels in a couple of weeks.

But where do things stand with synthesis, and with compiling efficient FSTs from some high-level 
language (ideally the whole FS algebra)?

We have a whole grab-bag of stuff for that, but it isn't very nice to deal with in Python for the time being. We have implementations of lexc (continuation classes approach) and twolc (two-level morphophonemic rule approach), the SFST programming language (a regexp-like system), xfst (another regexp-like system), an API and a bunch of command-line tools for FST algebraic operations, conversion functions between various formats etc. We also handle some finicky things like flag diacritics (a way to handle long-distance dependencies in morphology) & identity and unknown transitions. Default transitions are understood by some tools as well.

There's an experimental SWIG binding of the most top-level part of our API, but it needs a lot of polishing for being able to do things like building transducers arc-by-arc in Python. You do get things like the basic FST algebra, and reading & writing transducers. Also, nothing builds in Windows currently.
 
I see that the swig interface uses GPLv3, which isn't compatible with 
Apache.  Are you able to change it to Apache?

For the lookup stuff, yeah, it's really dual-licensed, I should really add the Apache license there. For the rest of our code, no, it's GPL3.

There's a more general issue, that we probably cannot fully resolve here, namely the design of a generic FST API for NLTK.  Your contribution would then be one implementation of this API, and it would be possible to add other implementations later.

 Well, such an api would presumably support at least reading transducer binaries from disk and performing transducer lookup, and for the rest we don't have anything Python-ready. So the starting path is clear. For the remainder, we do have a public API of our top-level transducer class http://hfst.sourceforge.net/hfst3/classhfst_1_1HfstTransducer.html which may be a guide.
Reply all
Reply to author
Forward
0 new messages