HFST toolkit to NLTK?

Erik Axelson

unread,

Sep 25, 2017, 6:33:39 AM9/25/17

to nltk-dev

Hi all,

I'm working in HFST - Helsinki Finite-State Technology project (main page http://hfst.github.io/). We have talked about getting HFST as a part of NLTK toolkit. There exists a python interface for HFST that uses C and C++ extensions. It is available as PyPI packages at https://pypi.python.org/pypi/hfst.

Is it possible to add python modules that use C++ extensions under NLTK? HFST uses three backend finite-state libraries, two of which are written in C++ (SFST and OpenFST) and one in C (foma). It is possible to configure HFST without some of the libraries, but support for weighted finite-state operations requires OpenFST which is written in C++. We have also released finite-state morphologies in a binary format that can be processed with a standalone tool for fast lookup, which is also written in C++.

Best regards,

Erik

liling tan

unread,

Sep 27, 2017, 9:26:06 PM9/27/17

to nltk-dev

Would like to see more FST integration in NLTK too!

Steven Bird

unread,

Sep 27, 2017, 10:12:18 PM9/27/17

to nltk-dev

Hi Erik,

Thanks for the suggestion! NLTK has long been without support for morphological processing.

Can you please tell us how you think the integration would go? E.g. considering a simple NLP pipeline such as [1], what would the HFST functions do? Suppose we read from a text corpus and tokenized, the next step might be morphological analysis. What then, parsing?

Also, we could think about any corpora of morphologically-analysed forms, and then see how well a particular morphological analyser performs against it. Or else, something like [2].

I'm asking these questions because we need to be clear about what value is added to NLTK by including more functionality, which we then need to maintain. Would it work just as well as a separate package. The best outcome it would be to have a tightly integrated and well documented component which makes it easier to process morphologically complex languages using Python.

As it happens, this topic is close to my heart as I've been living and working in an Indigenous community where a polysynthetic language is spoken. Kunwinjku verbs have up to 16 slots, including positions for adverbs and nouns [3].

-Steven Bird

[1] http://www.nltk.org/book/ch03.html#fig-pipeline1

[2] Cotterell et al 2016 The SIGMORPHON 2016 Shared Task—Morphological Reinflection

http://www.aclweb.org/anthology/W/W16/W16-2002.pdf

[3] http://stevenbird.net/kunwok

--
You received this message because you are subscribed to the Google Groups "nltk-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to nltk-dev+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply all

Reply to author

Forward