Sample code for Google N-gram with Kneser-Ney language model

Dina Kayumova

unread,

Jun 21, 2013, 4:36:34 AM6/21/13

to berkeleyl...@googlegroups.com

Hello, Adam!

At first I'd like to thank you for developing and sharing your toolkit. It is really very useful.

I have some questions concerning correct use of some methods and I would appreciate your help.

As I understood, method getLogProb() in StupidBackOff language model calculates language model score of an n-gram, which is not the same as probabilities. So if I want to get genuine probabilities of n-grams I'm to use Kneser-Ney language model. Is it possible to generate Kneser-Ney LM from Google N-gram?

I tried to do this via passing 'kneserNey = true' to LmReaders.readLmFromGoogleNgramDir(final String dir, final boolean compress, final boolean kneserNey, final WordIndexer<W> wordIndexer, final ConfigOptions opts) method. After that KneserNeyLmReaderCallback method fell with exception while trying to get START_SYMBOL hash code.

So in order to prevent this error I tried to create WordIndexer object like this:
final StringWordIndexer wordIndexer = new StringWordIndexer();
wordIndexer.setStartSymbol(GoogleLmReader.START_SYMBOL);
wordIndexer.setEndSymbol(GoogleLmReader.END_SYMBOL);
wordIndexer.setUnkSymbol(GoogleLmReader.UNK_SYMBOL);

and pass it to readLmFromGoogleNgramDir method. However, GoogleLmReader.START_SYMBOL is private member. Was it done 'private' for purpose? Or am I just to make it 'public' and there will be no error in such solution?

Would you be so kind as to provide me with some example on creating Kneser-Ney language model from Google N-gram?

Adam Pauls

unread,

Jun 21, 2013, 12:50:27 PM6/21/13

to berkeleyl...@googlegroups.com

Hmm. I'm not entirely sure I tested the functionality with kneserNey = true, and KneserNey estimation takes far more memory than stupid backoff. But if you think you have enough RAM, then by all means. (Note: I believe it takes something like 80G of RAM to build the Google N-gram binaries, so I really hope you have lots of RAM!)

I think the call you're looking for is at

https://code.google.com/p/berkeleylm/source/browse/trunk/src/edu/berkeley/nlp/lm/io/GoogleLmReader.java#135

You can create a wordIndexer and pass it into that function to have it properly initialized. You will need to give at path to the vocab_cs.gz file. Let me know if that doesn't work.

Adam

--
You received this message because you are subscribed to the Google Groups "berkeleylm-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to berkeleylm-disc...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Dina Kayumova

unread,

Jun 24, 2013, 10:17:49 AM6/24/13

to berkeleyl...@googlegroups.com

I created Kneser-Ney binary file with LM order = 2 without any problem. Then I got an access to computer with lots of RAM and tried to extend LM order to 4.
For that I used the following code in MakeLmBinaryFromGoogle.main:

final StringWordIndexer wordIndexer = new StringWordIndexer();

GoogleLmReader.addToIndexer(wordIndexer, googleDir+"/1gms/vocab_cs.gz");
final ArrayEncodedNgramLanguageModel<String> lm = LmReaders.readLmFromGoogleNgramDir(googleDir, false, true, wordIndexer, new ConfigOptions());


During computation I got the following error:
Exception in thread "main" java.lang.AssertionError
    at edu.berkeley.nlp.lm.io.KneserNeyLmReaderCallback.getLowerOrderBackoff(KneserNeyLmReaderCallback.java:211)
    at edu.berkeley.nlp.lm.io.KneserNeyLmReaderCallback.getProbBackoff(KneserNeyLmReaderCallback.java:342)
    at edu.berkeley.nlp.lm.io.KneserNeyLmReaderCallback.parse(KneserNeyLmReaderCallback.java:306)
    at edu.berkeley.nlp.lm.io.KneserNeyLmReaderCallback.parse(KneserNeyLmReaderCallback.java:37)
    at edu.berkeley.nlp.lm.io.LmReaders.firstPassCommon(LmReaders.java:553)
    at edu.berkeley.nlp.lm.io.LmReaders.firstPassArpa(LmReaders.java:530)
    at edu.berkeley.nlp.lm.io.LmReaders.readArrayEncodedLmFromArpa(LmReaders.java:171)
    at edu.berkeley.nlp.lm.io.LmReaders.readLmFromGoogleNgramDir(LmReaders.java:224)
    at edu.berkeley.nlp.lm.io.MakeLmBinaryFromGoogle.main(MakeLmBinaryFromGoogle.java:99)

In out stream I found the following:

...
Counting values {
                Writing Kneser-Ney probabilities {
                        Counting counts for order 0 {
                        } [0s]
                        Counting counts for order 1 {
                        } [7s]
                        Counting counts for order 2 {
                        } [8s]
                        Counting counts for order 3 {
                        } [4s]
                        On order 1
                        Writting line 1
                        ...
                        On order 2
                        Writting line ...
   ...

However I haven't any problems whlie generating Stupid Backoff binary file (LM order = 4) from the same <n>gms folders.
Have you any idea of the cause of the problem?

Adam Pauls

unread,

Jun 24, 2013, 11:22:02 AM6/24/13

to berkeleyl...@googlegroups.com

Did order = 3 work? This might take a while to debug, so I'll have to get back to you. If you're interested, you should try printing out the n-grams as they are read in, so you know which n-gram is problematic.

Message has been deleted

Dina Kayumova

unread,

Jun 25, 2013, 11:00:36 AM6/25/13

to berkeleyl...@googlegroups.com

Hi, Adam!
Thank you for reply.

Order = 3 worked. Order = 4 failed on a bigram which seems quite normal. Both its words are in the dictionary. I checked debug output and learched that both counts.totalCounts and counts.dotdotTypeCounts from KneserNeyLmReaderCallback.java are equal to '-1'. Unfortunately it means nothing to mee yet.

Please let me know if you learn anything concerning the matter.

Adam Pauls

unread,

Jun 25, 2013, 11:19:34 AM6/25/13

to berkeleyl...@googlegroups.com

Are you using a small enough data-set that you can run it on 5? I'm wondering if this is specific to order =4, or to your dataset.

Dina Kayumova

unread,

Jun 27, 2013, 3:19:48 AM6/27/13

to berkeleyl...@googlegroups.com

Hi, Adam!
Thank you for reply.

I tried to run order = 5, but failed. I've done some research and got some results which may be helpful in solving my issue.

I put some debug output into KneserNeyLmReaderCallback.java.

In 'public void parse(ArpaLmReaderCallback<ProbBackoffPair> callback)' I added output to see gram:
System.out.println("Gram:\t"+phrase); //debug output
final int endPos = ngram.length;
final int startPos = 0;
ProbBackoffPair value = getProbBackoff(ngram, startPos, endPos);

Then I added offset print in getCounts call:

final long offset = ngrams.getOffsetForNgramInModel(key, startPos, endPos);
System.out.println("offset = " + String.valueOf(offset)); //debug output
if (offset < 0)
        return value;

Before assert in 'protected float getLowerOrderBackoff(final int[] ngram, final int startPos, final int endPos)' I also added output:

final long backoffDenom = (endPos - startPos == lmOrder - 1 || ngram[startPos] == startIndex) ? counts.tokenCounts : counts.dotdotTypeCounts;
System.out.println(String.valueOf(startPos) + "\t" + String.valueOf(endPos) + "\t"
                + String.valueOf(counts.tokenCounts) + "\t"
                + String.valueOf(counts.dotdotTypeCounts) + "\t" + String.valueOf(backoffDenom)); // debug output
assert backoffDenom >= 0;

Then I rerun order = 2 to 5 and got the output:

Order = 2

Gram:    молодых    деятелей
offset = 30365332
offset = 2311642
offset = 2311642
0    1    50687265    -1    50687265
offset = 768327

Gram:    восторге    у
offset = 30365333
offset = 3723488
offset = 3723488
0    1    3444542    -1    3444542
offset = 723461
            Load factor for 1: 1.0
            Load factor for 2: 0.6831758701120106
        } [13m3s]
    } [13m3s]
} [29m52s]

Order = 4

Gram:    молодых    деятелей
offset = 30365332
offset = 2311642
offset = 2311642
0    1    -1    1123    1123
offset = 768327
offset = 30365332
0    2    -1    1    1
Gram:    восторге    у
offset = 30365333
offset = 3723488
offset = 3723488
0    1    -1    42    42
offset = 723461
offset = 30365333
0    2    -1    -1    -1
<ERROR!!!>

Order = 5

Gram:    молодых    деятелей
offset = 30365332
offset = 2311642
offset = 2311642
0    1    -1    1123    1123
offset = 768327
offset = 30365332
0    2    -1    1    1
Gram:    восторге    у
offset = 30365333
offset = 3723488
offset = 3723488
0    1    -1    42    42
offset = 723461
offset = 30365333
0    2    -1    -1    -1
<ERROR!!!>

Both 4 and 5 reported assertion error on the gram 'Gram:    восторге    у'. I searched and researched in output from order = 3 but couldn't find 'Gram:    восторге    у'.
Is it normal for different orders to proceed different grams? Or may be I did smth wrong that caused order=3 not to include the gram into processing?

As I understood from the code and as we can see from debug output the error raises when counts.dotdotTypeCounts == -1 and counts.tokenCounts == -1.

'counts' object initialization occurs in the following way:
final KneserNeyCounts counts = getCounts(ngram, startPos, endPos, true);

So let's examine 'getCounts' call.
As we can see from debug output startPos = 0 and endPos = 2. So first condition (startPos == endPos) doesn't hold.

Then in 'getCounts' call we get offset which seems to be positive:

final long offset = ngrams.getOffsetForNgramInModel(key, startPos, endPos);
System.out.println("offset = " + String.valueOf(offset));

The result of 'getCounts' call is 'value' variable which initialization takes place in the following line:

ngrams.getValues().getFromOffset(offset, endPos - startPos - 1, value);

The 'value' variable contains 'dotdotTypeCounts' and 'tokenCounts' fields. The values of these fields cause assertion error. Let's examine their initialization in 'getFromOffset' call.

In 'getFromOffset' call we can see the following:
outputVal.tokenCounts = isHighestOrder ? tokenCounts.get(offset) : (isSecondHighestOrder ? getSafe(offset, prefixTokenCounts) : -1);
outputVal.dotdotTypeCounts = (int) ((isHighestOrder || isSecondHighestOrder || (offset >= dotdotTypeCounts[ngramOrder].size())) ? -1
            : dotdotTypeCounts[ngramOrder].get(offset));

Thus we can conclude that 'isSecondHighestOrder == false' and 'isHighestOrder == false' and (offset >= dotdotTypeCounts[ngramOrder].size())

On that point my investigation was paused. What is dotdotTypeCounts and how it is linked with offset value? I try to move forward and would appreciate any ideas and thoughts from your side.

Thank you very much for helping me.

Adam Pauls

unread,

Jun 27, 2013, 11:50:37 AM6/27/13

to berkeleyl...@googlegroups.com

dotdotTypeCounts is the number of unique words that surround an n-gram (think [dot] word1 word2 [dot], and count the number of unique ways of filling in the dots). I'm really confused as to what's going on here. I'll probably be able to debug it myself next week.

Dina Kayumova

unread,

Jun 27, 2013, 4:54:31 PM6/27/13

to berkeleyl...@googlegroups.com

Adam, my dataset doesn't contain any <s> or </s> symbols. I wonder whether it may be the cause of the 'bad' gram stuff.

Adam Pauls

unread,

Jun 27, 2013, 7:30:33 PM6/27/13

to berkeleyl...@googlegroups.com

The estimation code should insert them for you. I don't think that's the issue.

On Thu, Jun 27, 2013 at 1:54 PM, Dina Kayumova <dina.k...@gmail.com> wrote:

Adam, my dataset doesn't contain any <s> or </s> symbols. I wonder whether it may be the cause of the 'bad' gram stuff.

--

Adam Pauls

unread,

Jul 4, 2013, 5:25:08 PM7/4/13

to berkeleyl...@googlegroups.com

I'm finally back to debugging. On second thought, the lack of <S> and </S> may actually cause a problem (because the Google n-gram corpus already has them). Can you try inserting some and seeing if that fixes the problem? In the meanwhile, I'm writing some tests.

Adam

Adam Pauls

unread,

Jul 5, 2013, 2:52:18 PM7/5/13

to berkeleyl...@googlegroups.com

I ran some tests and haven't been able to replicate the crash. Any chance you can share your data set (and the command you're running) so I can debug myself?

Adam

Dina Kayumova

unread,

Jul 8, 2013, 7:00:43 AM7/8/13

to berkeleyl...@googlegroups.com

Hi, Adam!

Now I haven't chance to add <s> and </s> into my data.

Here is an access link to my dataset: https://www.dropbox.com/sh/ocn4mq9z4qbk342/QXdU5-xLzb

The code I use to create Kneser-Ney binary file:

final String googleDir = argv[0];

final StringWordIndexer wordIndexer = new StringWordIndexer();

        GoogleLmReader.addToIndexer(wordIndexer, googleDir+"/1gms/vocab_cs.gz");
        final ArrayEncodedNgramLanguageModel<String> lm = LmReaders.readLmFromGoogleNgramDir(googleDir, false, true, wordIndexer, new ConfigOptions());

        Logger.endTrack();
        final String outFile = argv[1];
        IOUtils.writeObjFileHard(outFile, lm);

Wish you good luck.

On Fri, Jul 5, 2013 at 10:52 PM, Adam Pauls <adp...@gmail.com> wrote:

I ran some tests and haven't been able to replicate the crash. Any chance you can share your data set (and the command you're running) so I can debug myself?

Adam

--
You received this message because you are subscribed to a topic in the Google Groups "berkeleylm-discuss" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/berkeleylm-discuss/G6Ta2YTsAA0/unsubscribe.
To unsubscribe from this group and all its topics, send an email to berkeleylm-disc...@googlegroups.com.

Adam Pauls

unread,

Jul 9, 2013, 8:21:53 AM7/9/13

to berkeleyl...@googlegroups.com

How come the data set you gave me only has up to 4 grams, even though you ran to order 5?

--

You received this message because you are subscribed to the Google Groups "berkeleylm-discuss" group.

To unsubscribe from this group and stop receiving emails from it, send an email to berkeleylm-disc...@googlegroups.com.

Dina Kayumova

unread,

Jul 9, 2013, 8:28:37 AM7/9/13

to berkeleyl...@googlegroups.com

I haven't uploaded order 5 because I thought order 4 is enough for debugging. I'll upload order 5 now.

iesus.c...@gmail.com

unread,

Jul 17, 2014, 9:03:29 AM7/17/14

to berkeleyl...@googlegroups.com

Hi,

I'm also trying to get a language model with probabilities from the google n-gram corpus. I'm doing this for English. So, since I have to share the server that I have, I was wondering how many resources did you actually use to compute the Kneser-Ney language model?

You mentioned around 80GB of memory, is that for processing or for storage? I mean, you only need once the 80GB and then the language model will have a normal size (around 25GB) during usage? Also, how long does it take to compute the language model?

I will be really greatful if you can help me :)

Kind Regards

Jesús

On Tuesday, July 9, 2013 2:28:37 PM UTC+2, Dina Kayumova wrote:

I haven't uploaded order 5 because I thought order 4 is enough for debugging. I'll upload order 5 now.

On Tue, Jul 9, 2013 at 4:21 PM, Adam Pauls <adp...@gmail.com> wrote:

How come the data set you gave me only has up to 4 grams, even though you ran to order 5?

On Mon, Jul 8, 2013 at 4:00 AM, Dina Kayumova <dina.k...@gmail.com> wrote:

Hi, Adam!

Now I haven't chance to add <s> and </s> into my data.

Here is an access link to my dataset: https://www.dropbox.com/sh/ocn4mq9z4qbk342/QXdU5-xLzb

The code I use to create Kneser-Ney binary file:

        final String googleDir = argv[0];


        final StringWordIndexer wordIndexer = new StringWordIndexer();

        GoogleLmReader.addToIndexer(wordIndexer, googleDir+"/1gms/vocab_cs.gz");
        final ArrayEncodedNgramLanguageModel<String> lm = LmReaders.readLmFromGoogleNgramDir(googleDir, false, true, wordIndexer, new ConfigOptions());


        Logger.endTrack();
        final String outFile = argv[1];
        IOUtils.writeObjFileHard(outFile, lm);

Wish you good luck.

On Fri, Jul 5, 2013 at 10:52 PM, Adam Pauls <adp...@gmail.com> wrote:

I ran some tests and haven't been able to replicate the crash. Any chance you can share your data set (and the command you're running) so I can debug myself?

Adam

--
You received this message because you are subscribed to a topic in the Google Groups "berkeleylm-discuss" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/berkeleylm-discuss/G6Ta2YTsAA0/unsubscribe.

To unsubscribe from this group and all its topics, send an email to berkeleylm-discuss+unsub...@googlegroups.com.

For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "berkeleylm-discuss" group.

To unsubscribe from this group and stop receiving emails from it, send an email to berkeleylm-discuss+unsub...@googlegroups.com.

For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to a topic in the Google Groups "berkeleylm-discuss" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/berkeleylm-discuss/G6Ta2YTsAA0/unsubscribe.

To unsubscribe from this group and all its topics, send an email to berkeleylm-discuss+unsub...@googlegroups.com.

Adam Pauls

unread,

Sep 7, 2014, 2:41:53 PM9/7/14

to berkeleyl...@googlegroups.com

Sorry, getting back to debugging. I've lost context here. Can you give me a dataset and command line that will reproduce the error?

To unsubscribe from this group and stop receiving emails from it, send an email to berkeleylm-disc...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply all

Reply to author

Forward