Re: [joshua-support] About OutOfMemoryError during SAMT grammar extraction

34 views

Skip to first unread message

Matt Post

unread,

Mar 6, 2013, 9:38:48 AM3/6/13

to joshua_...@googlegroups.com

How are you increasing the memory for Hadoop?

Also, are you using a proper distributed Hadoop installation, or the standalone version?

On Mar 6, 2013, at 3:36 AM, "wsk...@yahoo.cn" <wsk...@yahoo.cn> wrote:

Hi,

Now I am trying to perform the SAMT system on a training data with about 2 million sentence pairs. The average length is about 13 words and 15 words for the two languages respectively. Using this corpus, I encounter a problem of OutOfMemoryError. The error information is as follows:

3768 13/03/07 04:13:48 INFO mapred.MapTask: record buffer = 262144/327680
3769 13/03/07 04:13:48 INFO compress.CodecPool: Got brand-new decompressor
3770 13/03/07 04:13:48 INFO mapred.MapTask: io.sort.mb = 100
3771 13/03/07 04:13:52 WARN mapred.LocalJobRunner: job_local_0005
3772 java.lang.OutOfMemoryError: Java heap space
3773 at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.<init>(MapTask.java:781)
3774 at org.apache.hadoop.mapred.MapTask$NewOutputCollector.<init>(MapTask.java:524)
3775 at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:613)
3776 at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
3777 at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177)
3778 13/03/07 04:13:55 INFO mapred.LocalJobRunner:
3779 13/03/07 04:13:55 INFO mapred.LocalJobRunner:
3780 13/03/07 04:13:55 INFO mapred.LocalJobRunner:

After this error, the system is still running. And encounter another OutOfMemoryError:

3825 13/03/07 04:14:44 INFO mapred.LocalJobRunner:
3826 13/03/07 04:14:46 WARN mapred.LocalJobRunner: job_local_0004
3827 java.lang.OutOfMemoryError: GC overhead limit exceeded
3828 at edu.jhu.thrax.hadoop.datatypes.AlignmentArray.readFields(AlignmentArray.java:47)
3829 at edu.jhu.thrax.hadoop.datatypes.RuleWritable.readFields(RuleWritable.java:99)
3830 at org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerializati on.java:67)
3831 at org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerializati on.java:40)
3832 at org.apache.hadoop.io.SequenceFile$Reader.deserializeKey(SequenceFile.java:2102)
3833 at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:2068)
3834 at org.apache.hadoop.mapreduce.lib.input.SequenceFileRecordReader.nextKeyValue(SequenceFileRecordReader.java: 68)
3835 at org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(MapTask.java:423)
3836 at org.apache.hadoop.mapreduce.MapContext.nextKeyValue(MapContext.java:67)
3837 at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:143)
3838 at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:621)
3839 at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
3840 at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177)
3841 13/03/07 04:14:50 INFO mapred.LocalJobRunner:
3842 13/03/07 04:14:50 INFO mapred.LocalJobRunner:

I try to solve this problem by changing the memory available for hadoop from 2g to 70g and then to 130g. However, the system still fails to get the SAMT grammar because of this error.

Is the memory really the reason that triggers this error? If so, could I resolve this problem without buying a larger memory?

Thank you very much!

Best.
wsknow

--
You received this message because you are subscribed to the Google Groups "Joshua Technical Support" group.
To unsubscribe from this group and stop receiving emails from it, send an email to joshua_suppor...@googlegroups.com.
To post to this group, send email to joshua_...@googlegroups.com.
Visit this group at http://groups.google.com/group/joshua_support?hl=en.
For more options, visit https://groups.google.com/groups/opt_out.

Matt Post

unread,

Mar 6, 2013, 9:20:55 PM3/6/13

to joshua_...@googlegroups.com

1. The --hadoop-mem flag specifies the amount of memory given to each hadoop mapper (it is passed as an argument to mapred.child.java.opts). It defaults to 2 GB, which almost always is enough. You may need to go to 4 GB at max, but I have never had problems with 2.

2. The extractions we've done were with a Hadoop cluster spread over 10 machines with terabytes of disk space. Your extraction job will easily consume a terabyte in temporary disk usage; do you have that much space? I have never pushed a job of that size through in standalone mode, and imagine it will easily take a week, and would be surprised if it finishes at all. I'll be happy to hear if you have success with it.

matt

On Mar 6, 2013, at 8:01 PM, "wsk...@yahoo.cn" <wsk...@yahoo.cn> wrote:

I set the parameter --hadoop-mem to 130g for the pipeline.pl. In addition, that should be the standalone version because I do not use the parallel setting. My command line is as follows:

$JOSHUA/scripts/training/pipeline.pl --corpus corpus/train --tune ./tuning --test input/test --source ch --target en --alignment corpus/train.align --hadoop-mem 130g --tuner mert --lmfile ./bigen.lm5 --no-corpus-lm --no-mbr --joshua-config joshua.conf --first-step THRAX --type samt --parsed-corpus corpus/train.parse --thrax-conf thrax-samt.conf

I remember that you have mentioned before that you used a corpus with about 2 million sentence pairs for Joshua. How much memory did you use in that setting? Did you utilize the default settings during grammar extraction?

Thank you very much.

Best.

在 2013年3月6日星期三UTC+8下午10时38分48秒，Matt Post写道：

Reply all

Reply to author

Forward

0 new messages