-Infinity logprobs and unnormalized probs - KneserNey GoogleNgram

71 views
Skip to first unread message

iesus.c...@gmail.com

unread,
Oct 1, 2014, 7:40:14 AM10/1/14
to berkeleyl...@googlegroups.com
Hi,

 I am trying to compute entropies of possible continuations given a context. Something like this:

H(context)=SUM_X(P(x|context)*logP(x|context))

where SUM_X is the sum over all words in the vocabulary X (x is an element in the vocabulary).

For this I'm using the google Ngram corpus for English and a KneserNey language model of order 4. 

I get a distribution of possible next words given a context using the method getDistributionOverNextWords in NgramLanguageModel.java in edu.berkeley.nlp.lm

Since the vocabulary of the corpus is more than 13 million tokens, and most of them are very unlikely and to make it faster, I only use the first 600 000 elements, ranked according to frequency.


So the output looks fine most of the time, but I noticed a couple of things. I haven't tested thouroughly, just with a couple of sentences. 

- One thing is that sometimes it returns -Infinity as the log probability of an ngram. Since I'm trying with words listed in the vocabulary, I think this is not an OOV issue, well I don't know, maybe it's related. Having the first 600 000 most frequent words only, this occurs with the following words as ngram endings:

EX000347
M'sheath
M'sphere
-3.4796
seanews2
000143423
3:00:00.0
0.221488

Of course I can live without this words, I just wonder whether this implies some other problems behind. The corpus contains the <UNK> symbol, so I guess this shouldn't happen.

- The other things is that sometimes the distribution obtained from getDistributionOverNextWords is not normalized. Most of the time the total probability is close to 1.0, so it's ok, but sometimes it goes well above, which I guess shouldn't happen either.

More concretely, it is not normalized for these ngrams:

"<S> two small"
which gives a total prob of 1.9999999708688196. I checked and it happened because the ngram "<S> two small cylinders" has a prob of 0.9999998807907066

"demonstrators were loudly"
which gives a total prob of 3.0955045791104068

"were loudly demanding"
TotalP:1.9124293472478409

I only tested in literally two sentences, so I guess this actually occurs quite frequently. Please let me know if you have some insight about it.

Thanks in advance!
Jesús

iesus.c...@gmail.com

unread,
Dec 5, 2014, 12:29:01 PM12/5/14
to berkeleyl...@googlegroups.com
So it's been 2 months already with no answer, could I get at least some guidance on where I should look?? I'm trying to use this language model because it gives a sort of domain independence (given the huge corpus), but the code is kind of difficult to follow for me. Could you check what is going on?? I noticed that in the KneserNeyLmReaderCallback class, the methods 

public void handleNgramOrderFinished(int order

public void handleNgramOrderStarted(int order

public void cleanup() 

are empty. Could that have something to do? It seems that some words are outliers and not handled properly, introducing normalisation problems. Could you give me a little bit of advice??

Thank you very much!

Jesús

Adam Pauls

unread,
Dec 5, 2014, 4:24:50 PM12/5/14
to berkeleyl...@googlegroups.com
Apologies, but I have a day job now and rarely find time to actively support BerkeleyLM. Unfortunately, your problem doesn't have an immediate an obvious answer to me. If you can give me some (minimal) inputs and a command that produces the problem, I can try debugging myself and see what is going on. I suggest you also use try debugging with good old fashioned print statements to see where the -Infs are coming from. 

iesus.c...@gmail.com

unread,
Dec 8, 2014, 2:14:48 PM12/8/14
to berkeleyl...@googlegroups.com
Thanks for the answer. Ok. So I'll try to be precise and succinct. First I got a KneserNey LM of order 4 using this:
 final String googleDir = argv[0];
 
final String vocabFile = argv[0]+"//1gms//vocab_cs.gz";
 
StringWordIndexer wordIndexer=new StringWordIndexer();
 
GoogleLmReader.addToIndexer(wordIndexer, vocabFile);
 
final NgramLanguageModel<String> lm = LmReaders.readLmFromGoogleNgramDir(googleDir, true, true, wordIndexer, new ConfigOptions());
 
final String outFile = argv[1];
 
Logger.startTrack("Writing to file " + outFile + " . . . ");
 
LmReaders.writeLmBinary(lm, outFile);

Then, having loaded the LM, I calculate entropies using the following method. This method calculates the entropy of the possible continuations of an tri-gram. So given a tri-gram for example "this is a", it will generate a probability distribution of all possible 4-grams. Given the probability distribution, it calculates the total probability by summing over all the possible continuations. So in theory the total probability should never be more than 1.0, but actually it happens quite often that the total probability is almost 2.0 or even more. THIS is my main issue, I don't really mind the tokens with -inf probability, since they are very very strange words and it makes sense that they are very improbable. What I would really need is that the probabilities are properly normalised, otherwise I cannot rely on the results.  

private double computeEntPerContext( NgramLanguageModel<String> lm, List<String> context){

 
Counter<String> contDistribution= NgramLanguageModel.StaticMethods.getDistributionOverNextWordsBounded(lm, context,entVoc);
 
double entropy=0.0;
 
double regProof=0.0;

 
Double singleLog10Prob=0.0;
 
Double singleLnProb=0.0;
 
Double singleProb=0.0;

 
for(String word:contDistribution.keySet()){
 singleLog10Prob=contDistribution.getCount(word);//This gives log10probs for each word continuation
 
if(singleLog10Prob.isInfinite()){
 System.out.println("\nInfinite number:"+singleLog10Prob);
 
System.out.println("Word:"+word+"\n");  

 }else{
 singleLnProb=singleLog10Prob*Math.log(10); //We convert it to ln
 singleProb
=Math.exp(singleLnProb);
  //System.out.println("Word: "+word+" prob:"+singleCount);
 Double smallNumber=singleProb*singleLnProb;
 
entropy
+=smallNumber;
regProof
+=singleProb;
 }//endElse
 }//endFor
 System.out.println("\nTotalP:"+regProof+"\n");//Should be as close to 1.0 as possible
 
return entropy*-1.0;
 
}


The method to generate the probability distribution is a slightly modified version of the method that is in the public interface NgramLanguageModel<W>. The difference is that instead of calculating the distribution using the complete vocabulary, it only uses the first 600 000 tokens which are contained in "entVocab". 

public static <W> Counter<W> getDistributionOverNextWordsBounded(final NgramLanguageModel<W> lm, List<W> context, ArrayList<W> entVocab) {
 
List<W> ngram = new ArrayList<W>();
 
for (int i = 0; i < lm.getLmOrder() - 1 && i < context.size(); ++i) {
 ngram
.add(context.get(context.size() - i - 1));
 
}

 
Collections.reverse(ngram);
 ngram
.add(null);
 
Counter<W> c = new Counter<W>();
 
for (int index = 0; index < entVocab.size(); ++index) {
 W word = entVocab.get(index);
 if (word.equals(lm.getWordIndexer().getStartSymbol())) continue;
 ngram.set(ngram.size() - 1, word);
 //c.setCount(word, Math.exp(lm.getLogProb(ngram) * Math.log(10))); //The output of getLogProb is based10 logprobs
 c.setCount(word, lm.getLogProb(ngram)); //we can live with log10probs
 }
 
return c;// p*logp = e^(log p + log log p)
 
}


So I guess the main issue lies in "lm.getLogProb(ngram)". which is a direct call to the language model. 

I hope the main issue was clear.

Regarding the input and output, the list of tokens, within the first 600 000 tokens in the vocabulary, that result in -inf probabilities are the following:

EX000347
M'sheath
M'sphere
-3.4796
seanews2
000143423
3:00:00.0
0.221488

That is, having an ngram "This is a", if we put as suffix any of these tokens ("This is a M'sheath"), the probability will be -inf.  

About the unnormalised probabilities, here are some tri-grams that present the error:

", it was" ---> This happens because the word "arms" has a probabiity of 0.9999999403953557, given the prefix.
"the New York" ---> "138" has a probability of 1.0
"Exchange did n't" ---> This one has a total probability of 1.512985473173851, no specific suffix had a probability higher than 0.2.
"apart Friday as" ---> TotalP:1.1713151107775326,  no specific suffix
"190.58 points -- " ---> TotalP:1.7277926300832143, no specific suffix
"final hour --" --->TotalP:1.252381507023826, no specific suffix
"-- it barely" ---> TotalP:1.6572820059720237, no specific suffix

 " '' installed after" -->TotalP:1.3595556547646706, no specific suffix (only "the", but it makes sense)
"<S> The" (bigram) --> TotalP:2.111923838483038, no specific suffix

So it seems a pretty serious problem of normalisation because it occurs in almost any sentence (i'm trying to analyse the Penn Treebank), so maybe it is something that happens at some point during the KneserNey computations. Please let me know if you need more information or in which part of the code I should look.

Thank you so much, I know that this is not your normal job, I really appreciate your help.

Kind Regards
Jesús

Adam Pauls

unread,
Dec 8, 2014, 8:32:04 PM12/8/14
to berkeleyl...@googlegroups.com
I believe the problem might be that you're not telling the word indexer what the start, end, and unk symbols are. See an example of that here:

...

iesus.c...@gmail.com

unread,
Dec 9, 2014, 6:25:27 AM12/9/14
to berkeleyl...@googlegroups.com
I checked but in "GoogleLmReader.addToIndexer(wordIndexer, vocabFile);" at the end of the method it calls
"addSpecialSymbols(wordIndexer);" which contains the following:

private static <W> void addSpecialSymbols(final WordIndexer<W> wordIndexer) {
 wordIndexer
.setStartSymbol(wordIndexer.getWord(wordIndexer.getOrAddIndexFromString(START_SYMBOL)));
 wordIndexer
.setEndSymbol(wordIndexer.getWord(wordIndexer.getOrAddIndexFromString(END_SYMBOL)));
 wordIndexer
.setUnkSymbol(wordIndexer.getWord(wordIndexer.getOrAddIndexFromString(UNK_SYMBOL)));
 
}


and the special symbols are defined in GoogleLmReader:

 private static final String START_SYMBOL = "<S>";
 
private static final String END_SYMBOL = "</S>";
 
private static final String UNK_SYMBOL = "<UNK>";


So I think that does the trick but let me know otherwise. 

Kind Regards

Jesús

Adam Pauls

unread,
Dec 16, 2014, 11:56:24 PM12/16/14
to berkeleyl...@googlegroups.com
Okay, seems like that may not be the issue. 

As I said, I probably won't be able to help you unless you can give me some data and a command I can run that reproduces the problem. If you can try debugging yourself (with print statements or a graphical debugger), and can find the exact point where the -Inf comes from, I may also be able to help then.

iesus.c...@gmail.com

unread,
Jan 28, 2015, 8:20:15 AM1/28/15
to berkeleyl...@googlegroups.com
Thanks again for your time and help. I've been trying to debug myself. And well, the issue appears also with the tiny googledir folder that comes in the test set, so it is at least feasible. So basically I created a Kneser-Ney language model using the following:

public static void main(final String[] argv) {

 
final String googleDir =""EclipseWorkspace//BerkeleyLM_Surprent//test//edu/berkeley//nlp//lm//io//googledir";
 
System.out.println("Reading Lm File " + googleDir + " . . . ");
 
String vocabFile = googleDir+"//1gms//vocab_cs.gz";
 
StringWordIndexer wordIndexer=new StringWordIndexer();
 
GoogleLmReader.addToIndexer(wordIndexer, vocabFile);
 
final NgramLanguageModel<String> lm = LmReaders.readLmFromGoogleNgramDir(googleDir, true, true, wordIndexer, new ConfigOptions());
 
final String outFile = "testNgramGoogleKN";
 
System.out.println("Writing to file " + outFile + " . . . ");
 
LmReaders.writeLmBinary(lm, outFile);
}
 
and I load it using the following:

private static NgramLanguageModel<String> readBinary(String binaryFile) {
NgramLanguageModel<String> lm = null;
Logger.startTrack("Reading LM Binary " + binaryFile);
lm = LmReaders.readLmBinary(binaryFile);
Logger.endTrack();
return lm;  
 }

And then I wanted to analyse prefixes of the form: <S> This is a

It does it with the method 
public float getLogProb(final int[] ngram, final int startPos, final int endPos

Depending on the ngram, if the word is not in the vocabulary it will give you a -Infinity.

During debugging I checked the "values" object in the ArrayEncodedProbBackoffLm (this is how the language model is instantiated), and in the backoffsForRank and probsForRank I noticed that both -Infinity and NaN are present, so I guess the problem lies in the language model computation, but I'm not sure. I guess you have tested the program more with other kinds of smoothing other than Kneser-Ney, so maybe the architecture for calling the model is ok, the issue lies only during the computation of Kneser-Ney.

In general -Infinity appears every time a word that is not in the vocabulary appears, or when it has to back-off to low-order ngrams.

During the Language model computation I also checked some  parts of the process but it is very complicated and without much guidelines it is very hard, I noticed that counts are only saved for highest and second to highest ngrams, so unigram counts are always lost, is that normal? (It appears they are not lost only if the ngram begins with <S>)  that would explain why when the model hast to backoff to lower orders it gets a 0 probability (-Infinity). 

I don't really have a data set, my data is the google n-gram corpus for English. I don't know if I can spread it, but I guess you have a copy. About my code I can send you my project if you want.

Thanks again!
Jesús
Reply all
Reply to author
Forward
0 new messages