Hi,
I am a graduate student studying the theory and application of Weighted Finite-State Transducers to Speech Recognition and Natural Language Processing.
Recently I open-sourced a small collection of stand-alone python implementations of several popular language model smoothing techniques. These include interpolated versions of Absolute Discounting, Kneser-Ney and Modified Kneser-Ney smoothing. There are also basic utilities for an ML model, counting n-grams, and evaluating/exploring the models. The project also includes a very simple, linear implementation suitable for converting an n-gram model in standard ARPA format to an equivalent Weighted Finite-State Acceptor. The project can be found on github, and is released under the BSD license:
The project is written in pure python, but unfortunately I did not design it with NLTK in mind; it is completely standalone. The implementations follow the seminal Chen & Goodman '98 paper on statistical language modeling, and are actually pretty efficient, even for medium size corpora. Nevertheless my main purpose with this project was to draw a transparent link between the theory and functional, complete implementations. These are basically educational implementations, and will certainly never outperform heavy-duty toolkits like SRILM or MITLM, but I think (hope) they might be easier for beginners to grok.
In any case, I noticed the other day that NLTK contains support for a variety of different n-gram smoothing methods, but does not contain implementations for any of the three approaches in my little library, or the ARPA to WFSA conversion algorithm. Modified Kneser-Ney smoothing is still pretty much the best option out there, and there is some FST-related stuff on the NLTK projects page, so I thought there might be interest in incorporating the project into NLTK - where it might be of use to some other people.
Finally the project is pretty well-documented, and I have also provided a script leveraging the Google OpenGRM NGramLibrary tools,
that can be used to verify that the parameter estimation and models it produces are correct. I'm sure there are still plenty of bugs, and that it can/should be made more robust, but I thought I'd throw it out here just in case there is some interest.
Thanks for your time and apologies for the rather long-winded e-mail.
Best regards, Joe