Improving/Extending the Language Modeling module

12 views
Skip to first unread message

Ilia Kurenkov

unread,
Dec 30, 2018, 10:03:33 AM12/30/18
to nltk-dev
Hi folks!

As was announced recently, starting with version 3.4 the language modeling module is back in NLTK.

To this I wanted to add that I ended up not having time to implement everything I wanted for this package. I focused my efforts on making the core objects and data-structures robust and easy to use. I was aiming to create a sort of toolkit that would make it fun for people to write their own models. If you are looking for small projects for yourself or to give to students, this could be a good opportunity!

Some specific areas that need improvement:

1. One notable model class that is currently missing is Simple Good Turing. I gave it a quick stab, but implementing it correctly was a bit too involved for me at the time. So that's a great candidate for new contributions!
2. Backoff models are also missing right now. The smoothing classes are written in such a way as to support both interpolation and backoff, but when I tried creating backoff models, they would violate certain statistical properties (summing to 1 namely) that are really a must with these models. Help figuring that out would also be appreciated!
3. Documentation fixes and improvements are always welcome of course! I've been staring at this stuff for long enough to have lost some perspective on what it looks like from a user's perspective.

Cheers,
Ilia
Reply all
Reply to author
Forward
0 new messages