memory consumption

Oren Melamud

unread,

Jul 18, 2014, 2:45:49 PM7/18/14

to berkeleyl...@googlegroups.com

Hi Adam,

I've used your LM toolkit in the past and it was very helpful. Thanks for sharing this!

This time I'm trying to train a 5-gram Kneser-Ney LM on a larger corpus, which includes over 2 billion words.
I'm running this on a linux machine with 48GB memory allocated for this task as follows:
java -ea -Xmx48000m -server -cp berkeleylm.jar edu.berkeley.nlp.lm.io.MakeKneserNeyArpaFromText 5 ukwac.5gram.arpa ukwac.txt

Unfortunately, I get a memory exception after reading only about 10% of the lines in the corpus:
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space

Any ideas what I could be doing wrong?

Thanks,
Oren.

Oren Melamud

unread,

Jul 21, 2014, 8:39:05 AM7/21/14

to berkeleyl...@googlegroups.com

Update:

I tried this with 3-grams and managed to build the LM using about 60 GB of RAM.
These are the ngram counts that I got, which seem pretty high relatively to the counts you report for WMT2010 on your paper.
ngram 1=100004
ngram 2=50032413
ngram 3=363244566

Any ideas how to build this LM with 4-grams or 5-grams without increasing RAM requirements?

Adam Pauls

unread,

Jul 21, 2014, 10:55:20 AM7/21/14

to berkeleyl...@googlegroups.com

Unfortunately, building a KN LM is very memory intensive in Berkeley LM. The numbers you give seem high, but not all that high. However, the final model after training should be compact.

One option is to build with SRILM (which uses disk instead of memory) and see if the counts are similar.

Sorry, I'm sure that's not the most satisfactory answer.

--
You received this message because you are subscribed to the Google Groups "berkeleylm-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to berkeleylm-disc...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Oren Melamud

unread,

Jul 21, 2014, 12:05:28 PM7/21/14

to berkeleyl...@googlegroups.com

Ok. Thanks for the quick reply.

On Monday, July 21, 2014 5:55:20 PM UTC+3, Adam Pauls wrote:

Unfortunately, building a KN LM is very memory intensive in Berkeley LM. The numbers you give seem high, but not all that high. However, the final model after training should be compact.

One option is to build with SRILM (which uses disk instead of memory) and see if the counts are similar.

Sorry, I'm sure that's not the most satisfactory answer.

On Mon, Jul 21, 2014 at 5:39 AM, Oren Melamud wrote:

Update:

I tried this with 3-grams and managed to build the LM using about 60 GB of RAM.
These are the ngram counts that I got, which seem pretty high relatively to the counts you report for WMT2010 on your paper.
ngram 1=100004
ngram 2=50032413
ngram 3=363244566

Any ideas how to build this LM with 4-grams or 5-grams without increasing RAM requirements?

On Friday, July 18, 2014 9:45:49 PM UTC+3, Oren Melamud wrote:

Hi Adam,

I've used your LM toolkit in the past and it was very helpful. Thanks for sharing this!

This time I'm trying to train a 5-gram Kneser-Ney LM on a larger corpus, which includes over 2 billion words.
I'm running this on a linux machine with 48GB memory allocated for this task as follows:
java -ea -Xmx48000m -server -cp berkeleylm.jar edu.berkeley.nlp.lm.io.MakeKneserNeyArpaFromText 5 ukwac.5gram.arpa ukwac.txt

Unfortunately, I get a memory exception after reading only about 10% of the lines in the corpus:
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space

Any ideas what I could be doing wrong?

Thanks,
Oren.

--
You received this message because you are subscribed to the Google Groups "berkeleylm-discuss" group.

To unsubscribe from this group and stop receiving emails from it, send an email to berkeleylm-discuss+unsub...@googlegroups.com.

Reply all

Reply to author

Forward