NaN-probabilities

miluette

unread,

Jan 19, 2016, 5:43:06 AM1/19/16

to berkeleylm-discuss

Hello,

I ran into the following problem:

when running ComputeLogProbabilities from terminal, using the Google Books binary and vocabularyfile I get Nan scores.

>>"Das ist ein Satz"| java -ea -mx1000m -server -cp ./src edu.berkeley.nlp.lm.io.ComputeLogProbabilityOfTextStream -g vocab_cs.gz ./src/ger.blm.gz

>>Reading Google Binary ./src/ger.blm.gz with vocab vocab_cs.gz {

} [12s]

Scoring file -; current log probability is 0.0 {

} [0s]

Log probability of text is: NaN

However when running the same so command with the Google 1T binary (but Google Books vocabulary) I get normal output (negative double).

Does somebody have an idea what mistake I am making?

jeremy

unread,

Nov 19, 2016, 5:45:24 PM11/19/16

to berkeleylm-discuss

how to solve it

在 2016年1月19日星期二 UTC-5上午5:43:06，miluette写道：

Gena Kukartsev

unread,

Mar 7, 2017, 7:26:23 PM3/7/17

to berkeleylm-discuss

sentence boundary symbols are not included in the vocabulary, I am not sure why, and that causes NaNs because scoreSentence method (rightfully) wraps the sentence in the boundary symbols. Which are not in the dictionary. Not sure how to solve it correctly. You can use getLogProb on small word combinations (no larger than model order), and compute probability of your sentence with those...but LM experts assure me that without boundary symbols in the vocabulary, the LM is not properly normalized - you cannot get a correct probability of a sentence. Another approach - train from your own corpus.

Reply all

Reply to author

Forward