NaN probability

53 views
Skip to first unread message

Roman Prokofyev

unread,
Aug 29, 2014, 5:15:35 AM8/29/14
to berkeleyl...@googlegroups.com
Hello,

I've started to receive NaN probabilities for some of my sentences and don't really know why this is happening.

I have run a debugger for problematic sentences and found that NaN is coming from "StupidBackoffLm.java" line 59:

    probContext = localMap.getValueAndOffset(probContext, probContextOrder, ngram[i], scratch);


The scratch value is not set for some reason so it equals to -1 which results in NaN after taking the log.

Don't know what other information might be important to solve the problem, like probContextOrder=1 for this case.
Thanks.

Adam Pauls

unread,
Sep 7, 2014, 2:25:23 PM9/7/14
to berkeleyl...@googlegroups.com
Can you the input and command line you used to get this error?


--
You received this message because you are subscribed to the Google Groups "berkeleylm-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to berkeleylm-disc...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Roman Prokofyev

unread,
Sep 22, 2014, 4:38:49 AM9/22/14
to berkeleyl...@googlegroups.com
Sorry for the delay, not receiving notifications on updates for some reason.
It seems that my data was corrupted while saving from HDFS.
Now everything is fine.
To unsubscribe from this group and stop receiving emails from it, send an email to berkeleylm-discuss+unsub...@googlegroups.com.

Gena Kukartsev

unread,
Mar 3, 2017, 2:56:19 PM3/3/17
to berkeleylm-discuss
I am still getting NaNs when trying to compute probabilities using Google Books binary and vocab from http://tomato.banatao.berkeley.edu:8080/berkeleylm_binaries/. I am using English. Can you (or anyone) post md5 sums for the english binary and vocab file, to validate the download? 

I am doing this:

$ echo "This is a sample sentence ." | java -ea -mx16g -server -cp ../src edu.berkeley.nlp.lm.io.ComputeLogProbabilityOfTextStream -g vocab_cs.gz eng.blm.gz
Reading Google Binary eng.blm.gz with vocab vocab_cs.gz {
} [58s]
Scoring file -; current log probability is 0.0 {
} [0s]
Log probability of text is: NaN

Gena Kukartsev

unread,
Mar 6, 2017, 6:07:34 PM3/6/17
to berkeleylm-discuss
I suspect that this NaN problem is due to begining-of-sentence, end-of-sentence, and <unk> issues. The Google books vocabulary from  http://tomato.banatao.berkeley.edu:8080/berkeleylm_binaries/ does not contain <S>, </s>, <UNK>. I downloaded the ngrams and rebuilt the model (up to 3-gram) but I get NaNs with it too. I am guessing that this is out-of-vocabulary words, specifically these three. The toy vocabularly in the repo in berkeleylm/test/edu/berkeley/nlp/lm/io/googledir/1gms/ does have all three. So all tests with toy googledir work but neither the downloaded model nor a model built from downloaded n-grams work.

I am now trying to figure out how to deal with it - I am new to the LMs. I would appreciate any advice. My current guess is that the default scoreSentence mandates bounded sentences which simply do not exist in google books ngrams. Am I in the right ballpark?

Thanks!

Gena

Gena Kukartsev

unread,
Mar 6, 2017, 6:07:35 PM3/6/17
to berkeleylm-discuss
I suspect that this NaN problem is due to begining-of-sentence, end-of-sentence, and <unk> issues. The Google books vocabulary from  http://tomato.banatao.berkeley.edu:8080/berkeleylm_binaries/ does not contain <S>, </s>, <UNK>. I downloaded the ngrams and rebuilt the model (up to 3-gram) but I get NaNs with it too. I am guessing that this is out-of-vocabulary words, specifically these three. The toy vocabularly in the repo in berkeleylm/test/edu/berkeley/nlp/lm/io/googledir/1gms/ does have all three. So all tests with toy googledir work but neither the downloaded model nor a model built from downloaded n-grams work.

I am now trying to figure out how to deal with it - I am new to the LMs. I would appreciate any advice. My current guess is that the default scoreSentence mandates bounded sentences which simply do not exist in google books ngrams. Am I in the right ballpark?

Thanks!

Gena

On Friday, March 3, 2017 at 11:56:19 AM UTC-8, Gena Kukartsev wrote:

Gena Kukartsev

unread,
Mar 6, 2017, 8:26:17 PM3/6/17
to berkeleylm-discuss
These NaNs are definitely because of OOV (out-of-vocabulary) words: beginning-of-sentence and end-of-sentence. Google n-grams do not have those but default lm.scoreSentence() method adds them before and after the provided text via BoundList. As a quick fix, I implemented scorePhrase() that does not bound sentence but I wonder what's the "proper" way to handle this?

In case anyone needs it in future, use this instead of scoreSentence:

//logProb += lm.scoreSentence(words);
logProb += scorePhrase(words,lm);

public static <T> float scorePhrase(final List<T> sentence, final NgramLanguageModel<T> lm) {
final int lmOrder = lm.getLmOrder();
System.out.println("LM order: "+lmOrder);
System.out.println("Phrase length: "+sentence.size());
float sentenceScore = 0.0f;
for (int i = 0; i < lmOrder - 1 && i <= sentence.size(); ++i) {
final List<T> ngram = sentence.subList(0, i+1);
System.out.println("first loop i="+i+", ngram: "+ngram);
final float scoreNgram = lm.getLogProb(ngram);
sentenceScore += scoreNgram;
}
for (int i = lmOrder - 1; i < sentence.size(); ++i) {
final List<T> ngram = sentence.subList(i - lmOrder+1, i+1);
System.out.println("second loop i="+i+", ngram: "+ngram);
final float scoreNgram = lm.getLogProb(ngram);
sentenceScore += scoreNgram;
}
return sentenceScore;
}
Reply all
Reply to author
Forward
0 new messages