Pipeline problem

176 views
Skip to first unread message

Mohamed EL MAROUANI

unread,
Jan 13, 2016, 7:55:04 PM1/13/16
to joshua_...@googlegroups.com
Hi,
In order to continue my experiments, I run the Joshua pipeline with a set of corpora (TED datasets) that I prepared in a directory as bellow:

elma@VBUbuntu:~/models/en-fr/input$ ls
dev2010.en-fr.en  dev2010.en-fr.fr  train.en-fr.en  train.en-fr.fr  tst2010.en-fr.en  tst2010.en-fr.fr

When I run the following pipeline command:

$JOSHUA/scripts/training/pipeline.pl  --rundir 1  --type hiero  --lm-gen berkeleylm --lm berkeleylm --corpus input/train.en-fr  --tune input/dev2010.en-fr  --test input/tst2010.en-fr  --source en  --target fr
 

the script is broken in the step [giza-0] with these messages:

[giza-0] rebuilding...
  dep=/home/elma/models/en-fr/1/data/train/splits/corpus.en.0 [CHANGED]
  dep=/home/elma/models/en-fr/1/data/train/splits/corpus.fr.0 [CHANGED]
  dep=alignments/0/model/aligned.grow-diag-final [NOT FOUND]
  cmd=rm -f alignments/0/corpus.0-0.*; /home/elma/git/joshua/scripts/training/run-giza.pl --root-dir alignments/0 -e fr.0 -f en.0 -corpus /home/elma/models/en-fr/1/data/train/splits/corpus -merge grow-diag-final -parallel > alignments/0/giza.log 2>&1
  JOB FAILED (return code 127)
[aligner-combine] rebuilding...
  dep=alignments/0/model/aligned.grow-diag-final [NOT FOUND]
  dep=alignments/training.align [NOT FOUND]
  cmd=cat alignments/0/model/aligned.grow-diag-final > alignments/training.align
  JOB FAILED (return code 1)
cat: alignments/0/model/aligned.grow-diag-final: No such file or directory


I checked these files. they are really not generated.
It is an other problem related to the compilation of giza++, like kenLM? please your help..

Thanks!

--
Mohamed 

Matt Post

unread,
Jan 14, 2016, 2:22:35 PM1/14/16
to joshua_...@googlegroups.com
I'm guessing that GIZA didn't compile. What version of Joshua are you using? Are you using a release version or the github version? Do you see the binaries

GIZA++
snt2cooc.out
mkcls

in $JOSHUA/bin?


--
You received this message because you are subscribed to the Google Groups "Joshua Technical Support" group.
To unsubscribe from this group and stop receiving emails from it, send an email to joshua_suppor...@googlegroups.com.
To post to this group, send email to joshua_...@googlegroups.com.
Visit this group at https://groups.google.com/group/joshua_support.
For more options, visit https://groups.google.com/d/optout.

Mohamed EL MAROUANI

unread,
Jan 14, 2016, 3:50:06 PM1/14/16
to joshua_...@googlegroups.com
The version used is 6.0.5 getting from the github repository.

The binaries GIZA++ and snt2cooc.out are in the path:  $JOSHUA/bin/giza-pp/GIZA++-v2
and the mkcls is in $JOSHUA/bin/giza-pp/mkcls-v2

Matt Post

unread,
Jan 14, 2016, 3:53:34 PM1/14/16
to joshua_...@googlegroups.com
The binaries should be directly in $JOSHUA/bin:

$JOSHUA/bin/mkcls
$JOSHUA/bin/GIZA++
$JOSHUA/bin/snt2cooc.out

Is this the case?

If you're having trouble with GIZA, you could use the Berkeley aligner (pass "--aligner berkeley" to the pipeline).

matt

Mohamed EL MAROUANI

unread,
Jan 15, 2016, 8:03:47 PM1/15/16
to joshua_...@googlegroups.com

I rebuilt Joshua, and I find that the problem of the three binaries location was caused by automatic building option in Eclipse which overwrite bin directory.

Now, the bin directory is like the following:

elma@VBUbuntu:~/git/joshua/bin$ ls
bleu  decoder  extract-1best  GIZA++  joshua-decoder  meteor  mkcls  pipeline.pl  snt2cooc.out

But the pipeline command is broken also in the same step with other error code:

[giza-0] rebuilding...
  dep=/home/elma/models/en-fr/1/data/train/splits/corpus.en.0 [CHANGED]
  dep=/home/elma/models/en-fr/1/data/train/splits/corpus.fr.0 [CHANGED]
  dep=alignments/0/model/aligned.grow-diag-final [NOT FOUND]
  cmd=rm -f alignments/0/corpus.0-0.*; /home/elma/git/joshua/scripts/training/run-giza.pl --root-dir alignments/0 -e fr.0 -f en.0 -corpus /home/elma/models/en-fr/1/data/train/splits/corpus -merge grow-diag-final -parallel > alignments/0/giza.log 2>&1
  JOB FAILED (return code 2)
[aligner-combine] rebuilding...
  dep=alignments/0/model/aligned.grow-diag-final [NOT FOUND]
  dep=alignments/training.align [NOT FOUND]
  cmd=cat alignments/0/model/aligned.grow-diag-final > alignments/training.align
  JOB FAILED (return code 1) 


In other hand, I executed the pipeline as recommended using Berkeley aligner. It failed in the step thrax-run like the following:


[thrax-run] rebuilding...
  dep=/home/elma/models/en-fr/1/data/train/thrax-input-file [CHANGED]
  dep=thrax-hiero.conf [CHANGED]
  dep=grammar.gz [NOT FOUND]
  cmd=hadoop/bin/hadoop jar /home/elma/git/joshua/thrax/bin/thrax.jar -D mapred.child.java.opts='-Xmx2g' -D hadoop.tmp.dir=/tmp thrax-hiero.conf pipeline-en-fr-hiero-_home_elma_models_en-fr_1 > thrax.log 2>&1; rm -f grammar grammar.gz; hadoop/bin/hadoop fs -getmerge pipeline-en-fr-hiero-_home_elma_models_en-fr_1/final/ grammar.gz
  JOB FAILED (return code 1)
16/01/15 23:40:32 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
getmerge: `pipeline-en-fr-hiero-_home_elma_models_en-fr_1/final/': No such file or directory


--
Mohamed

Matt Post

unread,
Jan 20, 2016, 10:35:24 AM1/20/16
to joshua_...@googlegroups.com
So GIZA is still failing. You can look at .cachepipe/giza-0/err or alignments/0/giza.log to see what errors it gives you.

It looks like the Berkeley aligner worked. Do you have a file alignments/training.align that is the same length as the corpora in data/train/corpus.*?

Now model extraction is failing. Do you have a Hadoop cluster? If not, the pipeline tries to build one for you, and perhaps that failed.

Mohamed EL MAROUANI

unread,
Jan 20, 2016, 2:38:36 PM1/20/16
to joshua_...@googlegroups.com
Hi Matt,

To avoid an eventual problem of version, I used the stable release downloaded from http://joshua-decoder.org/releases/current/ not the master version of github.

Firstly, I realized that the problem with giza++ still persist. After analyzing log file, I find that is a problem of symal package as mentioned below:

[.../alignments/0/giza.log]
....
Executing: bash -c rm -f alignments/0/giza.fr.0-en.0/fr.0-en.0.A3.final.gz
Executing: bash -c gzip alignments/0/giza.fr.0-en.0/fr.0-en.0.A3.final
Waiting for second GIZA process...
(3) generate word alignment @ Sat Jan 16 00:36:15 GMT 2016
Combining forward and inverted alignment from files:
  alignments/0/giza.en.0-fr.0/en.0-fr.0.A3.final.{bz2,gz}
  alignments/0/giza.fr.0-en.0/fr.0-en.0.A3.final.{bz2,gz}
Executing: bash -c mkdir -p alignments/0/model
Executing: bash -c /home/elma/git/joshua/scripts/training/symal/giza2bal.pl -d <(gzip -cd alignments/0/giza.fr.0-en.0/fr.0-en.0.A3.final.gz) -i <(gzip -cd alignments/0/giza.en.0-fr.0/en.0-fr.0.A3.final.gz) |/home/elma/git/joshua/scripts/training/symal/symal -alignment="grow" -diagonal="yes" -final="yes" -both="no" -o=alignments/0/model/aligned.grow-diag-final
bash: /home/elma/git/joshua/scripts/training/symal/symal: No such file or directory
bash: /home/elma/git/joshua/scripts/training/symal/giza2bal.pl: No such file or directory
Exit code: 127
ERROR: Can't generate symmetrized alignment file


Berkeley works fine, but the pipeline fails in next steps:
- The problem with the run in my previous mail was OutOfMemory problem (RAM Saturation of my virtual machine).
- I'm still continuing my experiments, but I have always problems: either with outofmemory exception or with this error NullPointerException:

[.../tune/joshua.log]
Input 0: <s> what i 'm going to show you first , as quickly as i can , is some foundational work , some new technology that we brought to microsoft as part of an acquisition almost exactly a year ago . this is seadragon , </s>
Input 0: Collecting options took 0.000 seconds
Input 0: FATAL UNCAUGHT EXCEPTION: null
java.lang.NullPointerException
        at joshua.decoder.phrase.Candidate.score(Candidate.java:214)
        at joshua.decoder.phrase.Candidate.compareTo(Candidate.java:136)
        at joshua.decoder.phrase.Candidate.compareTo(Candidate.java:19)
        at java.util.HashMap.compareComparables(HashMap.java:371)
        at java.util.HashMap$TreeNode.treeify(HashMap.java:1920)
        at java.util.HashMap.treeifyBin(HashMap.java:771)
        at java.util.HashMap.putVal(HashMap.java:643)
        at java.util.HashMap.put(HashMap.java:611)
        at java.util.HashSet.add(HashSet.java:219)
        at joshua.decoder.phrase.Stack.addCandidate(Stack.java:125)
        at joshua.decoder.phrase.Stacks.search(Stacks.java:166)
        at joshua.decoder.DecoderThread.translate(DecoderThread.java:113)
        at joshua.decoder.Decoder$DecoderThreadRunner.run(Decoder.java:218)
 

I used --type {phrase,moses} in my last experiments to skip OutofMemory problem but the above NullPointerException is triggered. 

Finally, I decided to migrate to CentOS in order to avoid these problems of building C++ components of Joshua. I installed my OS and I start environnement preparation today.  

Thank you!

Mohamed EL MAROUANI

unread,
Jan 22, 2016, 6:26:42 AM1/22/16
to joshua_...@googlegroups.com

As a complement of our discussion, I remark that we have in $JOSHUA/scripts/training/run-giza.pl these two variables:

my $SYMAL = "$JOSHUA/scripts/training/symal/symal";
my $GIZA2BAL = "$JOSHUA/scripts/training/symal/giza2bal.pl"

the two files doesn't exist in the path: $JOSHUA/scripts/training/

Can you explain me please about this issue.. Thanks!

--
Mohamed

Matt Post

unread,
Jan 22, 2016, 9:00:45 AM1/22/16
to joshua_...@googlegroups.com
Ah; these were recently moved to $JOSHUA/src. Updating the paths, or creating a symlink, should fix this.

my $SYMAL = "$JOSHUA/scripts/src/symal/symal";
my $GIZA2BAL = "$JOSHUA/scripts/src/symal/giza2bal.pl"

This will be fixed in the next release of Joshua.
Reply all
Reply to author
Forward
0 new messages