Berkeley aligner and thrax

84 views
Skip to first unread message

Mojtaba Sabbagh

unread,
Sep 2, 2015, 6:05:33 PM9/2/15
to Joshua Technical Support
When I use Berkeley aligner, the thrax can't build the grammar. It processes for several hours but the output is:

[thrax-run] rebuilding...

  dep=/media/new300/smt/models/en-fa/2/data/train/thrax-input-file [CHANGED]

  dep=thrax-hiero.conf [CHANGED]

  dep=grammar.gz [NOT FOUND]

  cmd=/usr/local/hadoop/bin/hadoop jar /usr/joshua-v6.0.4/thrax/bin/thrax.jar -D mapred.child.java.opts='-Xmx2g' thrax-hiero.conf /pipeline-en-fa-hiero-_media_new300_smt_models_en-fa_2 > thrax.log 2>&1; rm -f grammar grammar.gz; /usr/local/hadoop/bin/hadoop fs -getmerge /pipeline-en-fa-hiero-_media_new300_smt_models_en-fa_2/final/ grammar.gz

  JOB FAILED (return code 1)

15/09/03 02:24:47 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable

getmerge: `/pipeline-en-fa-hiero-_media_new300_smt_models_en-fa_2/final/': No such file or directory


the following are the last lines of thrax.log

15/09/03 02:24:42 INFO mapred.LocalJobRunner: reduce task executor complete.

[SCHED] class edu.jhu.thrax.hadoop.features.mapred.TargetPhraseGivenSourceFeature in state SUCCESS

class edu.jhu.thrax.hadoop.jobs.OutputJob PREREQ_FAILED

class edu.jhu.thrax.hadoop.jobs.SourceWordGivenTargetWordProbabilityJob SUCCESS

class edu.jhu.thrax.hadoop.features.mapred.SourcePhraseGivenTargetFeature FAILED

class edu.jhu.thrax.hadoop.features.annotation.AnnotationFeatureJob SUCCESS

class edu.jhu.thrax.hadoop.jobs.ExtractionJob SUCCESS

class edu.jhu.thrax.hadoop.jobs.VocabularyJob SUCCESS

class edu.jhu.thrax.hadoop.jobs.TargetWordGivenSourceWordProbabilityJob SUCCESS

class edu.jhu.thrax.hadoop.features.mapred.TargetPhraseGivenSourceFeature SUCCESS


I don't have this problem with GIZA++.

Matt Post

unread,
Sep 3, 2015, 9:24:45 AM9/3/15
to joshua_...@googlegroups.com
Hmm; that doesn't make much sense. Looking at the code, I don't see how you could get 

"... -getmerge /pipeline..."

what are the file sizes of

data/train/thrax-input-file
data/train/corpus.*
alignments/training.align

when using the Berkeley parser?

matt


--
You received this message because you are subscribed to the Google Groups "Joshua Technical Support" group.
To unsubscribe from this group and stop receiving emails from it, send an email to joshua_suppor...@googlegroups.com.
To post to this group, send email to joshua_...@googlegroups.com.
Visit this group at http://groups.google.com/group/joshua_support.
For more options, visit https://groups.google.com/d/optout.

Mojtaba Sabbagh

unread,
Sep 3, 2015, 2:52:11 PM9/3/15
to Joshua Technical Support
I think, you mean "/" at the begin of pipeline...
yes, the pipeline script doesn't generate "/pipeline"  but hadoop didn't put the files without that starting "/" in its file system and I had to add that "/" at the beginning. It works properly.

-rw-r--r-- 1 *** ***   225052027 Sep  1 19:03 train/thrax-input-file (225MB)

lrwxrwxrwx 1 *** ***          18 Sep  3 12:20 corpus.en -> train.tok.50.lc.en 

lrwxrwxrwx 1 *** ***          18 Sep  3 12:20 corpus.fa -> train.tok.50.lc.fa

d-rw-r--r-- 1 *** ***   60870747 Sep  1 14:41 train.tok.50.lc.en (60MB)

-rw-r--r-- 1 *** ***   104482582 Sep  1 14:41 train.tok.50.lc.fa (104MB)

-rw-r--r-- 1 *** ***    51882540 Sep  1 19:03 training.align     (51MB)


There is only 25GB of space available on my hard disk partition. Maybe because of that?


Regards,

Mojtaba

Matt Post

unread,
Sep 3, 2015, 2:56:04 PM9/3/15
to joshua_...@googlegroups.com
Yes, I meant the /.

And sorry, I meant to ask for the number of lines in those files, not the file sizes.

If you're running a rolled-out Hadoop cluster, then yes, 25 GB may be too small. If you do something like

hadoop fs -ls

do you see old run directories that you could wipe out first? Perhaps the problem is that you ran the Berkeley aligner second.

matt

Mojtaba Sabbagh

unread,
Sep 3, 2015, 4:40:49 PM9/3/15
to Joshua Technical Support
I formated hadoop fs and also made room in disk partition, now I am sure there is enough space but the same error.

I executed this command: hadoop fs -ls /

drwxr-xr-x   - *** supergroup          0 2015-09-04 00:17 /pipeline-en-fa-hiero-_media_new300_smt_models_en-fa_2

it shows the first run was wiped out.


977040 2/data/train/thrax-input-file

977042 2/data/train/train.tok.50.lc.en

977042 2/data/train/train.tok.50.lc.fa

977042 2/alignments/training.align

Matt Post

unread,
Sep 4, 2015, 2:41:27 PM9/4/15
to joshua_...@googlegroups.com
Okay, that all looks correct. What are the exact commands you invoked the pipeline with, when doing GIZA++ alignment and Berkeley?

matt

Mojtaba Sabbagh

unread,
Sep 6, 2015, 9:41:12 AM9/6/15
to Joshua Technical Support
I changed the data and run pipeline again with these two commands:

run 1:
nohup $JOSHUA/bin/pipeline.pl   --rundir 1  --readme "Baseline" --joshua-mem 32g --hadoop-mem 32g --source en --target fa  --type hiero  --corpus  input/train  --tune input/tune --test       input/test       &

nohup $JOSHUA/bin/pipeline.pl   --rundir 2  --readme "Baseline aligner berkeley" --joshua-mem 32g --hadoop-mem 32g --aligner berkeley --source en --target fa  --type hiero  --corpus  input/train  --tune input/tune --test       input/test       & 

The first run went completely right and I got BLEU 0.1139 and the second run passed grammar generation (took around 1h...) but the tune took 5s and the BLEU  is awful (0.0007). I removed 2/tune/joshua.config.final and launched pipeline again but no use. Now I skipped over that and trying jacana aligner. The commands I used before were the same.

Mojtaba

Matt Post

unread,
Sep 7, 2015, 2:20:29 AM9/7/15
to joshua_...@googlegroups.com
One thing I've seen with lower-resource languages is that the Berkeley aligner generates much more sensible alignments (for example it's less prone to garbage collection because of the joint learning objective), and therefore permits much larger grammars to be extracted. I suspect this is part of your problem. What is the source of the training data you are using?

How big are these directories? if they're not too big, you could tar them up for me and share them with me (not over email, please!) and I will take a look.

matt
Reply all
Reply to author
Forward
0 new messages