Hello everybody,
They use bi-gram language model with Katz-backoff and for the unigram step a Laplace smoothing of 0.2. They generate a language model for every month and compare user posts to the corresponding "snapshot language models". They do so by calculating the cross-entropie of the user posts bi-grams with the snapshot language models.
Is this currently possible in NLTK? Because as I read, the n-gram Model package is still under construction (since 2013). Or can I use an old Version of NLTK where this was still possible?
I would love to have an in-Python solution, as I looked at kenLM and SRILM but they both are not quite as handy as NLTK would be.
I am grateful for any push in the right direction,
Thanks in Advance!
Clem