public void handleNgramOrderFinished(int order)
public void handleNgramOrderStarted(int order)
public void cleanup()
are empty. Could that have something to do? It seems that some words are outliers and not handled properly, introducing normalisation problems. Could you give me a little bit of advice??
Thank you very much!
Jesús
final String googleDir = argv[0];
final String vocabFile = argv[0]+"//1gms//vocab_cs.gz";
StringWordIndexer wordIndexer=new StringWordIndexer();
GoogleLmReader.addToIndexer(wordIndexer, vocabFile);
final NgramLanguageModel<String> lm = LmReaders.readLmFromGoogleNgramDir(googleDir, true, true, wordIndexer, new ConfigOptions());
final String outFile = argv[1];
Logger.startTrack("Writing to file " + outFile + " . . . ");
LmReaders.writeLmBinary(lm, outFile);
private double computeEntPerContext( NgramLanguageModel<String> lm, List<String> context){
Counter<String> contDistribution= NgramLanguageModel.StaticMethods.getDistributionOverNextWordsBounded(lm, context,entVoc);
double entropy=0.0;
double regProof=0.0;
Double singleLog10Prob=0.0;
Double singleLnProb=0.0;
Double singleProb=0.0;
for(String word:contDistribution.keySet()){
singleLog10Prob=contDistribution.getCount(word);//This gives log10probs for each word continuation
if(singleLog10Prob.isInfinite()){System.out.println("\nInfinite number:"+singleLog10Prob);
System.out.println("Word:"+word+"\n");}else{
singleLnProb=singleLog10Prob*Math.log(10); //We convert it to ln
singleProb=Math.exp(singleLnProb);
//System.out.println("Word: "+word+" prob:"+singleCount);
Double smallNumber=singleProb*singleLnProb;
entropy+=smallNumber;
regProof+=singleProb;
}//endElse}//endFor
System.out.println("\nTotalP:"+regProof+"\n");//Should be as close to 1.0 as possible
return entropy*-1.0;
}The method to generate the probability distribution is a slightly modified version of the method that is in the public interface NgramLanguageModel<W>. The difference is that instead of calculating the distribution using the complete vocabulary, it only uses the first 600 000 tokens which are contained in "entVocab".
public static <W> Counter<W> getDistributionOverNextWordsBounded(final NgramLanguageModel<W> lm, List<W> context, ArrayList<W> entVocab) {
List<W> ngram = new ArrayList<W>();
for (int i = 0; i < lm.getLmOrder() - 1 && i < context.size(); ++i) {
ngram.add(context.get(context.size() - i - 1));
}
Collections.reverse(ngram);
ngram.add(null);
Counter<W> c = new Counter<W>();
for (int index = 0; index < entVocab.size(); ++index) {
W word = entVocab.get(index);if (word.equals(lm.getWordIndexer().getStartSymbol())) continue;ngram.set(ngram.size() - 1, word);//c.setCount(word, Math.exp(lm.getLogProb(ngram) * Math.log(10))); //The output of getLogProb is based10 logprobsc.setCount(word, lm.getLogProb(ngram)); //we can live with log10probs
}
return c;// p*logp = e^(log p + log log p)
}
So I guess the main issue lies in "lm.getLogProb(ngram)". which is a direct call to the language model.
I hope the main issue was clear.
Regarding the input and output, the list of tokens, within the first 600 000 tokens in the vocabulary, that result in -inf probabilities are the following:
...
private static <W> void addSpecialSymbols(final WordIndexer<W> wordIndexer) {
wordIndexer.setStartSymbol(wordIndexer.getWord(wordIndexer.getOrAddIndexFromString(START_SYMBOL)));
wordIndexer.setEndSymbol(wordIndexer.getWord(wordIndexer.getOrAddIndexFromString(END_SYMBOL)));
wordIndexer.setUnkSymbol(wordIndexer.getWord(wordIndexer.getOrAddIndexFromString(UNK_SYMBOL)));
}and the special symbols are defined in GoogleLmReader:
private static final String START_SYMBOL = "<S>";
private static final String END_SYMBOL = "</S>";
private static final String UNK_SYMBOL = "<UNK>";So I think that does the trick but let me know otherwise.
Kind Regards
Jesús
public static void main(final String[] argv) {
final String googleDir =""EclipseWorkspace//BerkeleyLM_Surprent//test//edu/berkeley//nlp//lm//io//googledir";
System.out.println("Reading Lm File " + googleDir + " . . . ");
String vocabFile = googleDir+"//1gms//vocab_cs.gz";
StringWordIndexer wordIndexer=new StringWordIndexer();
GoogleLmReader.addToIndexer(wordIndexer, vocabFile);
final NgramLanguageModel<String> lm = LmReaders.readLmFromGoogleNgramDir(googleDir, true, true, wordIndexer, new ConfigOptions());
final String outFile = "testNgramGoogleKN";
System.out.println("Writing to file " + outFile + " . . . ");
LmReaders.writeLmBinary(lm, outFile);
}
private static NgramLanguageModel<String> readBinary(String binaryFile) {
NgramLanguageModel<String> lm = null;Logger.startTrack("Reading LM Binary " + binaryFile);lm = LmReaders.readLmBinary(binaryFile);Logger.endTrack();return lm;
}