Proposed contribution: python implementations of Absolute Discounting, Kneser-Ney and Modified Kneser-Ney smoothing

Joe

unread,

Dec 4, 2012, 1:27:20 AM12/4/12

to nltk...@googlegroups.com

Hi,

I am a graduate student studying the theory and application of Weighted Finite-State Transducers to Speech Recognition and Natural Language Processing.

Recently I open-sourced a small collection of stand-alone python implementations of several popular language model smoothing techniques. These include interpolated versions of Absolute Discounting, Kneser-Ney and Modified Kneser-Ney smoothing. There are also basic utilities for an ML model, counting n-grams, and evaluating/exploring the models. The project also includes a very simple, linear implementation suitable for converting an n-gram model in standard ARPA format to an equivalent Weighted Finite-State Acceptor. The project can be found on github, and is released under the BSD license:

https://github.com/AdolfVonKleist/SimpleLM

The project is written in pure python, but unfortunately I did not design it with NLTK in mind; it is completely standalone. The implementations follow the seminal Chen & Goodman '98 paper on statistical language modeling, and are actually pretty efficient, even for medium size corpora. Nevertheless my main purpose with this project was to draw a transparent link between the theory and functional, complete implementations. These are basically educational implementations, and will certainly never outperform heavy-duty toolkits like SRILM or MITLM, but I think (hope) they might be easier for beginners to grok.

In any case, I noticed the other day that NLTK contains support for a variety of different n-gram smoothing methods, but does not contain implementations for any of the three approaches in my little library, or the ARPA to WFSA conversion algorithm. Modified Kneser-Ney smoothing is still pretty much the best option out there, and there is some FST-related stuff on the NLTK projects page, so I thought there might be interest in incorporating the project into NLTK - where it might be of use to some other people.

Finally the project is pretty well-documented, and I have also provided a script leveraging the Google OpenGRM NGramLibrary tools,

http://www.openfst.org/twiki/bin/view/GRM/NGramLibrary

that can be used to verify that the parameter estimation and models it produces are correct. I'm sure there are still plenty of bugs, and that it can/should be made more robust, but I thought I'd throw it out here just in case there is some interest.

Thanks for your time and apologies for the rather long-winded e-mail.

Best regards, Joe

Joe

unread,

Dec 4, 2012, 1:30:01 AM12/4/12

to nltk...@googlegroups.com

Oops, I just realized that there are some other, much older projects out there on the web named SimpleLM. My project has absolutely no relation to them, and maybe I need to change the name.

Steven Bird

unread,

Dec 4, 2012, 6:13:33 PM12/4/12

to nltk...@googlegroups.com

Thanks Joe. I'm keen to have your work incorporated into NLTK, since it would fill a lot of gaps in our existing support for language modelling. Would you be willing for us to do this?

Is anyone on the nltk-dev list interested in taking on the task of integrating Joe's work into NLTK?

-Steven

--

Josef Novak

unread,

Dec 4, 2012, 7:45:23 PM12/4/12

to nltk...@googlegroups.com

Hi Steven,

I'd be more than happy if you think it would be a useful addition. You can use it / modify it however you like or think would be best for NLTK.

This week I don't really have any time to work on it myself, but come the end of next week I will have a good deal of free time to help with integration, if so desired.

I guess the main issue would be the method of integration. Since I did not design it with NLTK in mind, the classes probably don't exactly fit the other LM stuff in NLTK, or the coding conventions. I tried to be very thorough with the documentation, but my perspective is definitely biased by my own experience.

In any case, both the Kneser-Ney and Modified Kneser-Ney classes have two implementations:

One for training the model directly from raw text
One for training the model from raw counts

The raw counts implementation should be reasonably simple to integrate with NLTK, but it is a bit slower, and in my opinion more opaque, than the implementation going directly from the raw text. The version going directly from the raw text is based on the 'store-only-what-you-need' approach proposed by Chen and Goodman and counts contexts directly. The absolute discounting implementation only works from raw text, but it would be pretty trivial to write a from-counts version, if needed.

The other issue I guess is that there are no real tests, other than the adhoc verification scripts I wrote against Google NGramLibrary, to make sure I'd got the parameter computation right.

Best regards,

Joe

--

Steven Bird

unread,

Dec 6, 2012, 2:49:46 PM12/6/12

to nltk-dev

Thanks for this Joe. I'd like to propose the following first step: would you please take a look at nltk.model.ModelI (the language model interface) and tell us how you would need to modify or extend it?

Given the minimality of our existing nltk.model subpackage, I'm not averse to replacing it entirely if it comes to that.

--

Dayvid Welles

unread,

Mar 24, 2016, 11:08:10 PM3/24/16

to nltk-dev

Hi,

For curiosity, these methods have been integrated into NLT

Hans B.

unread,

Jul 10, 2017, 5:43:29 PM7/10/17

to nltk-dev

Hi all,

thanks for all the tremendous work in these packages. I was wondering what the status of this particular Kneser Ney is, since https://stackoverflow.com/questions/35242155/kneser-ney-smoothing-of-trigrams-using-python-nltk seems to demonstrate that the resulting propabilities count up to strange results. Is that a misunderstanding, or could there be a bug here?

Cheers
Hans

Reply all

Reply to author

Forward