Google N-gram language model of order 4

iesus.c...@gmail.com

unread,

Aug 4, 2014, 7:40:18 AM8/4/14

to berkeleyl...@googlegroups.com

Hi,

I'm trying to get a language model with n-grams of order 4 using the Google N-gram corpus for English and Kneser-Ney smoothing. I think this have been asked before but at some point the threads stop.

So basically, I set the Kneser-Ney flag to True, I create a word indexer and initialize it as you mention in another thread and well everything runs smoothly, until the part where it starts processing the 4grams files. Then it gets a "fatal error" from java runtime. I don't think memory is an issue, I'm using a server with 500Gb of ram.

Given that this has been asked before, maybe someone there figured this out already. Personally I haven't started debugging because it takes a long time for each run, so I was really really hoping that maybe someone could give me some directions.

Kind Regards

Jesús

Adam Pauls

unread,

Aug 4, 2014, 3:01:48 PM8/4/14

to berkeleyl...@googlegroups.com

The code that builds a KN-smoothed model is much less memory-efficient than the code that builds stupid backoff models. It may well use more than 500GB of RAM :(

Can you show the exact error you received? Did you make sure to give the VM enough memory (-mx500g?)

--
You received this message because you are subscribed to the Google Groups "berkeleylm-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to berkeleylm-disc...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

iesus.c...@gmail.com

unread,

Aug 5, 2014, 2:54:21 PM8/5/14

to berkeleyl...@googlegroups.com

Here is the output in terminal, I had set -Xmx450Gb, I'm going to try now with 500Gb, I didn't want to disturb the other users:

Reading ngrams of order 4 {
Reading ngrams from file /scratch/common/nobackup/calvillo/google5grams/4gms/4gm-0000.gz {
Line 0
#
# A fatal error has been detected by the Java Runtime Environment:
#
# SIGSEGV (0xb) at pc=0x00007fa73e8fc5b4, pid=48266, tid=139871353304832
#
# JRE version: Java(TM) SE Runtime Environment (7.0_51-b13) (build 1.7.0_51-b13)
# Java VM: Java HotSpot(TM) 64-Bit Server VM (24.51-b03 mixed mode linux-amd64 )
# Problematic frame:
# V [libjvm.so+0x3ba5b4] CardTableExtension::scavenge_contents_parallel(ObjectStartArray*, MutableSpace*, HeapWord*, PSPromotionManager*, unsigned int, unsigned int)+0x204
#
# Failed to write core dump. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again
#
[thread 139871324882688 also had an error]
# An error report file with more information is saved as:
# [thread 139871324882688 also had an error]
#
# If you would like to submit a bug report, please visit:
# http://bugreport.sun.com/bugreport/crash.jsp
#

I also attach the complete output and the log file that java generated, just in case it helps :P

Thanks!

Jesús

To unsubscribe from this group and stop receiving emails from it, send an email to berkeleylm-discuss+unsub...@googlegroups.com.

hs_err_pid41125.log

outTerminal.txt

Adam Pauls

unread,

Aug 5, 2014, 3:12:55 PM8/5/14

to berkeleyl...@googlegroups.com

Hmm, yeah, you might need to do some Googling there. That's a JVM bug, not an out-of-memory error.

You might try changing the garbage collection strategy, but really, your guess is as good as mine.

To unsubscribe from this group and stop receiving emails from it, send an email to berkeleylm-disc...@googlegroups.com.

iesus.c...@gmail.com

unread,

Aug 20, 2014, 12:35:15 PM8/20/14

to berkeleyl...@googlegroups.com

I did as you said and tried with several options regarding garbage collection, finally I found a collector that doesn't crash :D

java -server -ea -Xmx480G -XX:+UseConcMarkSweepGC -cp src/ edu.berkeley.nlp.lm.io.MakeKneserNeyLmBinaryFromGoogle /scratch/common/nobackup/calvillo/google5grams/ googleKN.binary

BUT, now I have a rather weird exception:

Exception in thread "main" java.lang.NegativeArraySizeException

at edu.berkeley.nlp.lm.collections.LongToIntHashMap.rehash(LongToIntHashMap.java:138)

at edu.berkeley.nlp.lm.collections.LongToIntHashMap.rehash(LongToIntHashMap.java:130)

at edu.berkeley.nlp.lm.collections.LongToIntHashMap.put(LongToIntHashMap.java:100)

at edu.berkeley.nlp.lm.collections.LongToIntHashMap.incrementCount(LongToIntHashMap.java:111)

at edu.berkeley.nlp.lm.io.FirstPassCallback.call(FirstPassCallback.java:48)

at edu.berkeley.nlp.lm.io.FirstPassCallback.call(FirstPassCallback.java:24)

at edu.berkeley.nlp.lm.io.KneserNeyLmReaderCallback.parse(KneserNeyLmReaderCallback.java:307)

at edu.berkeley.nlp.lm.io.KneserNeyLmReaderCallback.parse(KneserNeyLmReaderCallback.java:37)

at edu.berkeley.nlp.lm.io.LmReaders.firstPassCommon(LmReaders.java:549)

at edu.berkeley.nlp.lm.io.LmReaders.firstPassArpa(LmReaders.java:526)

at edu.berkeley.nlp.lm.io.LmReaders.readArrayEncodedLmFromArpa(LmReaders.java:171)

at edu.berkeley.nlp.lm.io.LmReaders.readLmFromGoogleNgramDir(LmReaders.java:224)

at edu.berkeley.nlp.lm.io.MakeKneserNeyLmBinaryFromGoogle.main(MakeKneserNeyLmBinaryFromGoogle.java:43)

This happens during the:

...

Writing line 1224670001

Writing line 1224680001

Writing line 1224690001

Writing line 1224700001

...

of the ngrams of order 4 part.

I checked the code and tried to figure out the reason but I have no idea. It seems that in:

private void rehash(final int length) {
checkNotImmutable();
long[] newKeys = new long[length]; <--HERE
int[] newValues = new int[length];

it tries to build an array of negative size, but I checked and length is initialized based on the "keys" array, which of course couldn't have a negative size and is initialized in the constructor.

Do you have any idea of why this could happen?

Thank you very much!

Jesús

To unsubscribe from this group and stop receiving emails from it, send an email to berkeleylm-discuss+unsubscribe@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Adam Pauls

unread,

Aug 20, 2014, 1:17:42 PM8/20/14

to berkeleyl...@googlegroups.com

That looks like it's trying to make an array with greater than 2^31 elements. That's not something that's easy to fix, unforunately. If only they had 64-bit array indexes in Java!

To unsubscribe from this group and stop receiving emails from it, send an email to berkeleylm-disc...@googlegroups.com.

iesus.c...@gmail.com

unread,

Aug 20, 2014, 3:33:36 PM8/20/14

to berkeleyl...@googlegroups.com

Oh, I see, yeah, it looks that I would need to make quite some modifications... Pff, alright, well I guess I'll stick to order 3 or Stupid Backoff, it was worth trying anyways.

Thanks for your help! :D

Jesús

Adam Pauls

unread,

Aug 20, 2014, 4:02:36 PM8/20/14

to berkeleyl...@googlegroups.com

Well, one hack you could do is find where that value is increased (I believe it's grown by 1.5 whenever the table is full), and just make sure it maxes out an Integer.MAX_VALUE. You might be able to cram everything in still . . .

To unsubscribe from this group and stop receiving emails from it, send an email to berkeleylm-disc...@googlegroups.com.

iesus.c...@gmail.com

unread,

Aug 25, 2014, 2:21:35 PM8/25/14

to berkeleyl...@googlegroups.com

Thanks! I tried that and I think it should work, at least according to the way the array grows from one order to the other, it seems that setting the array size to the maximum should be enough for ngrams of order 4.

However, now there is another exception, which I think is the same as reported in the other thread. The error is the following:

Exception in thread "main" java.lang.AssertionError

at edu.berkeley.nlp.lm.io.KneserNeyLmReaderCallback.getLowerOrderBackoff(KneserNeyLmReaderCallback.java:211)

at edu.berkeley.nlp.lm.io.KneserNeyLmReaderCallback.getProbBackoff(KneserNeyLmReaderCallback.java:342)

at edu.berkeley.nlp.lm.io.KneserNeyLmReaderCallback.parse(KneserNeyLmReaderCallback.java:306)

at edu.berkeley.nlp.lm.io.KneserNeyLmReaderCallback.parse(KneserNeyLmReaderCallback.java:37)

at edu.berkeley.nlp.lm.io.LmReaders.firstPassCommon(LmReaders.java:549)

at edu.berkeley.nlp.lm.io.LmReaders.firstPassArpa(LmReaders.java:526)

at edu.berkeley.nlp.lm.io.LmReaders.readArrayEncodedLmFromArpa(LmReaders.java:171)

at edu.berkeley.nlp.lm.io.LmReaders.readLmFromGoogleNgramDir(LmReaders.java:224)

at edu.berkeley.nlp.lm.io.MakeKneserNeyLmBinaryFromGoogle.main(MakeKneserNeyLmBinaryFromGoogle.java:43)

It appears at the beginning of the "On order 3 part":

...

Writing line 314820001

Writing line 314830001

Writing line 314840001

On order 3

Writing line 1

The concrete code part is the following:

protected float getLowerOrderBackoff(final int[] ngram, final int startPos, final int endPos) {
if (startPos == endPos) return 1.0f;
final KneserNeyCounts counts = getCounts(ngram, startPos, endPos, true);
final long backoffDenom = (endPos - startPos == lmOrder - 1 || ngram[startPos] == startIndex) ? counts.tokenCounts : counts.dotdotTypeCounts;
assert backoffDenom >= 0 <---- HERE

I haven't added the debugging code that the other user included, but it seems to be the same situation.

In the previous posts I had used the complete google ngram corpus, and had left the variable lmOrder as 5, as it is in the code in LmReaders.java line 220:

if(kneserNey) {
final int lmOrder = 5;// TODO make this not hard-coded
KneserNeyLmReaderCallback<W> kneserNeyReader = new KneserNeyLmReaderCallback<W>(wordIndexer, lmOrder, opts);

I wanted to know at which point the system would run out of resources and that's why I tried with order 5. But when I set lmOrder to 4, then the error appeared. Previously the system was able to do the:
...
Writing line 314830001
Writing line 314840001
...

part for order 3 without problems, and it was after some point of the ngrams of order 4 where the problem of the array size appeared. So I guess it has something to do with the way lmOrder is set and the way ngrams of order n-2 and n-1 are handled. What would be the correct way to set lmOrder if I wanted a language model of order 4? I assume it should be 4, but now I'm not so sure, 3 or maybe 5?. Did you find a solution for the problem in the other thread?

Thanks again! :D

Adam Pauls

unread,

Sep 7, 2014, 3:02:24 PM9/7/14

to berkeleyl...@googlegroups.com

I'm a afraid I don't totally follow. Any chance that you can find a way for me to reproduce the problem? I guess you need to give me lots of data?

To unsubscribe from this group and stop receiving emails from it, send an email to berkeleylm-disc...@googlegroups.com.

iesus.c...@gmail.com

unread,

Sep 7, 2014, 6:39:21 PM9/7/14

to berkeleyl...@googlegroups.com

Hi Adam,

Actually there was a problem with my svn repository, so for some mistake the ngram order was not set to 4. I'm currently running a last trial to get the language model and hopefully it will be ok, it took be.

More concretely at the end the very last problem I think was because the system was set up to have 5grams, but the directory contained no longer any 5gram.

I will update you with the result of the last trial, sorry for not updating this before.

Kind Regards

Jesús

iesus.c...@gmail.com

unread,

Sep 16, 2014, 11:59:34 AM9/16/14

to berkeleyl...@googlegroups.com

Hi,

So, just to let you know, I was able to get the language model after all! :D These were the settings I used:

java -server -ea -Xmx480G -XX:+UseConcMarkSweepGC -cp src/ edu.berkeley.nlp.lm.io.MakeKneserNeyLmBinaryFromGoogle googleNgramcorpusDir outputFile

And well, it is just necessary to setup the n-gram order to 4, and do the trick to setup the maximum array size.