-Infinity log probability

58 views
Skip to first unread message

Roman Prokofyev

unread,
Jul 15, 2014, 11:33:05 AM7/15/14
to berkeleyl...@googlegroups.com
Hello,

I'm trying to compute log probability on a test collection and notices that around 15 sentences out if 57k (Brown corpus) yield log probability of -Infinity.
So I'm wondering why's that? Isn't stupid backoff model is smoothed as well? Or only Knesser-Ney?

Thanks.

Adam Pauls

unread,
Jul 15, 2014, 9:51:12 PM7/15/14
to berkeleyl...@googlegroups.com
This happens with out of vocabulary words. You need to include an <UNK> token in your data to handle this properly.


--
You received this message because you are subscribed to the Google Groups "berkeleylm-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to berkeleylm-disc...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Message has been deleted

Roman Prokofyev

unread,
Jul 16, 2014, 5:02:11 AM7/16/14
to berkeleyl...@googlegroups.com
I just did more debugging and figure out that there were no OOV words.

It turned out that the problem is in StupidBackoffLm.java, line 72:
pow(alpha, i - startPos)

alpha is set to 0.4 and if i>120 or so, then pow is 0, which results in logarithm being an infinity.

Probably it is better to expand the log calculation to ensure this situation never happens:

log(x/y*pow(a,b)) = log(x/y) + log(pow(a,b)) = log(x/y) + b*log(a),

this way it will be i*log(0.4), if i = 140, it will be just -128.

I could create a patch, but don't know how it's done on Google Code.


On Wednesday, July 16, 2014 3:51:12 AM UTC+2, Adam Pauls wrote:
This happens with out of vocabulary words. You need to include an <UNK> token in your data to handle this properly.
On Tue, Jul 15, 2014 at 8:33 AM, Roman Prokofyev <roman.p...@gmail.com> wrote:
Hello,

I'm trying to compute log probability on a test collection and notices that around 15 sentences out if 57k (Brown corpus) yield log probability of -Infinity.
So I'm wondering why's that? Isn't stupid backoff model is smoothed as well? Or only Knesser-Ney?

Thanks.

--
You received this message because you are subscribed to the Google Groups "berkeleylm-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to berkeleylm-discuss+unsub...@googlegroups.com.

Adam Pauls

unread,
Jul 16, 2014, 11:16:16 AM7/16/14
to berkeleyl...@googlegroups.com
Why would i ever be that large?


To unsubscribe from this group and stop receiving emails from it, send an email to berkeleylm-disc...@googlegroups.com.

Roman Prokofyev

unread,
Jul 16, 2014, 2:06:58 PM7/16/14
to berkeleyl...@googlegroups.com
In Brown corpus (http://en.wikipedia.org/wiki/Brown_Corpus), which is often used to test language models,
there are sentences that large.

But the question is not about why "i" is that large, the question is about the flaw in the library :)


On Wednesday, July 16, 2014 5:16:16 PM UTC+2, Adam Pauls wrote:
Why would i ever be that large?
On Wed, Jul 16, 2014 at 2:02 AM, Roman Prokofyev <roman.p...@gmail.com> wrote:
I just did more debugging and figure out that there were no OOV words.

It turned out that the problem is in StupidBackoffLm.java, line 72:
pow(alpha, i - startPos)

alpha is set to 0.4 and if i>120 or so, then pow is 0, which results in logarithm being an infinity.

Probably it is better to expand the log calculation to ensure this situation never happens:

log(x/y*pow(a,b)) = log(x/y) + log(pow(a,b)) = log(x/y) + b*log(a),

this way it will be i*log(0.4), if i = 140, it will be just -128.

I could create a patch, but don't know how it's done on Google Code.


On Wednesday, July 16, 2014 3:51:12 AM UTC+2, Adam Pauls wrote:
This happens with out of vocabulary words. You need to include an <UNK> token in your data to handle this properly.
On Tue, Jul 15, 2014 at 8:33 AM, Roman Prokofyev <roman.p...@gmail.com> wrote:
Hello,

I'm trying to compute log probability on a test collection and notices that around 15 sentences out if 57k (Brown corpus) yield log probability of -Infinity.
So I'm wondering why's that? Isn't stupid backoff model is smoothed as well? Or only Knesser-Ney?

Thanks.

--
You received this message because you are subscribed to the Google Groups "berkeleylm-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to berkeleylm-discuss+unsubscribe@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Adam Pauls

unread,
Jul 16, 2014, 8:10:56 PM7/16/14
to berkeleyl...@googlegroups.com
Right, but you should never have i-startPos > lm-order (so something like 4 or 5). 

Are you buy chance trying to call getLogProb on the whole sentence? Or do you really allow 100-grams in your model?


To unsubscribe from this group and stop receiving emails from it, send an email to berkeleylm-disc...@googlegroups.com.

Roman Prokofyev

unread,
Jul 17, 2014, 3:31:18 AM7/17/14
to berkeleyl...@googlegroups.com
Ok, there might be some other problem then.

It seems that the getLogProb is called on the whole sentence, but I'm just using ComputeLogProbabilityofTextStream class from .io package.
As far as I see, this class executes "computeProb" method in which it calls getLogProb on the list of all words in the sentence.
I don't know if this is right or wrong, but that's what I see in the code.

The same thing happens if I build the model from the test data provided with the code.
I'm attaching the test sentence that produces this result.
large_sentence.txt

iesus.c...@gmail.com

unread,
Jul 17, 2014, 10:08:48 AM7/17/14
to berkeleyl...@googlegroups.com
I'm having the same problem. For some sentences the system tries to find ngrams that go beyond the maximum ngram order, I'm not exactly sure why or when this happens. For some sentences it works, but for some other it doesn't, using fairly common words. 

What I did in the same method, was adding the following, I'm not sure if at the end the computations are correct, but at least it doesn't crash (for me it crashes, null pointer exception):

if(probContextOrder>=map.getMaxNgramOrder()-1){//Jesus addition to avoid errors for trying to lookup 6grams

System.out.println("Max ngramorder:"+map.getMaxNgramOrder());

System.out.println("I broke!");

break;

This is also using the ComputeLogProbabilityOfTextStream class. So, what exactly does this class do? I'm interested in computing prefix probabilities in order to calculate surprisal scores, do you think I should use this class?

Thank you very much!
Jesús
 

Adam Pauls

unread,
Jul 17, 2014, 9:42:31 PM7/17/14
to berkeleyl...@googlegroups.com
Well that's embarrassing -- that's a glaring bug in ComputeLogProbabilityofTextStream. It should be calling this method: https://code.google.com/p/berkeleylm/source/browse/trunk/src/edu/berkeley/nlp/lm/ArrayEncodedNgramLanguageModel.java#46 on the List<string> of words.

I'm not near a computer where I can patch that just now. 


To unsubscribe from this group and stop receiving emails from it, send an email to berkeleylm-disc...@googlegroups.com.

iesus.c...@gmail.com

unread,
Jul 18, 2014, 6:07:14 AM7/18/14
to berkeleyl...@googlegroups.com
Thank you! Now the output looks much better, and no exceptions :D 

Roman Prokofyev

unread,
Jul 19, 2014, 5:55:09 PM7/19/14
to berkeleyl...@googlegroups.com
Thanks, looking forward to the fix!

Adam Pauls

unread,
Sep 7, 2014, 2:59:39 PM9/7/14
to berkeleyl...@googlegroups.com
Please update from the latest SVN and see if your problems are fixed.


On Sat, Jul 19, 2014 at 2:55 PM, Roman Prokofyev <roman.p...@gmail.com> wrote:
Thanks, looking forward to the fix!
Reply all
Reply to author
Forward
0 new messages