Error in Training a model

67 views
Skip to first unread message

James Johnson

unread,
Mar 30, 2015, 6:39:05 AM3/30/15
to joshua_...@googlegroups.com
Command : JOSHUA/bin/pipeline.pl --rundir 1 --corpus input/train --tune input/tune --test input/test --aligner berkeley  --lm-gen srilm --lm-order 3 --source en --target hi

version : joshua v6.0.1

Size of the data set


   lines        : Input file
  
   268000        train.en

   268000        train.en
 
      5000         test.en
    
      5000         test.hi
     
      1000          tune.en
   
      1000          tune.hi




Output :

source en --target hi
[train-copy-hi] rebuilding...
  dep=/home/smt/HindiMachineTranslationSystem/input/train.hi [CHANGED]
  dep=/home/smt/HindiMachineTranslationSystem/1/data/train/train.hi.gz [NOT FOUND]
  cmd=cat /home/smt/HindiMachineTranslationSystem/input/train.hi | gzip -9n > /home/smt/HindiMachineTranslationSystem/1/data/train/train.hi.gz
  took 16 seconds (16s)
[train-copy-en] rebuilding...
  dep=/home/smt/HindiMachineTranslationSystem/input/train.en [CHANGED]
  dep=/home/smt/HindiMachineTranslationSystem/1/data/train/train.en.gz [NOT FOUND]
  cmd=cat /home/smt/HindiMachineTranslationSystem/input/train.en | gzip -9n > /home/smt/HindiMachineTranslationSystem/1/data/train/train.en.gz
  took 3 seconds (3s)
[train-tokenize-hi] rebuilding...
  dep=/home/smt/HindiMachineTranslationSystem/1/data/train/train.hi.gz [CHANGED]
  dep=/home/smt/HindiMachineTranslationSystem/1/data/train/train.tok.hi.gz [NOT FOUND]
  cmd=/home/smt/joshua-v6.0.1/scripts/training/scat /home/smt/HindiMachineTranslationSystem/1/data/train/train.hi.gz | /home/smt/joshua-v6.0.1/scripts/training/normalize-punctuation.pl hi | /home/smt/joshua-v6.0.1/scripts/training/penn-treebank-tokenizer.perl -l hi 2> /dev/null | gzip -9n > /home/smt/HindiMachineTranslationSystem/1/data/train/train.tok.hi.gz
  took 20 seconds (20s)
[train-tokenize-en] rebuilding...
  dep=/home/smt/HindiMachineTranslationSystem/1/data/train/train.en.gz [CHANGED]
  dep=/home/smt/HindiMachineTranslationSystem/1/data/train/train.tok.en.gz [NOT FOUND]
  cmd=/home/smt/joshua-v6.0.1/scripts/training/scat /home/smt/HindiMachineTranslationSystem/1/data/train/train.en.gz | /home/smt/joshua-v6.0.1/scripts/training/normalize-punctuation.pl en | /home/smt/joshua-v6.0.1/scripts/training/penn-treebank-tokenizer.perl -l en 2> /dev/null | gzip -9n > /home/smt/HindiMachineTranslationSystem/1/data/train/train.tok.en.gz
  took 13 seconds (13s)
[train-trim] rebuilding...
  dep=/home/smt/HindiMachineTranslationSystem/1/data/train/train.tok.hi.gz [CHANGED]
  dep=/home/smt/HindiMachineTranslationSystem/1/data/train/train.tok.en.gz [CHANGED]
  dep=/home/smt/HindiMachineTranslationSystem/1/data/train/train.tok.50.hi.gz [NOT FOUND]
  dep=/home/smt/HindiMachineTranslationSystem/1/data/train/train.tok.50.en.gz [NOT FOUND]
  cmd=paste <(gzip -cd /home/smt/HindiMachineTranslationSystem/1/data/train/train.tok.hi.gz) <(gzip -cd /home/smt/HindiMachineTranslationSystem/1/data/train/train.tok.en.gz) | /home/smt/joshua-v6.0.1/scripts/training/trim_parallel_corpus.pl 50 | /home/smt/joshua-v6.0.1/scripts/training/split2files.pl /home/smt/HindiMachineTranslationSystem/1/data/train/train.tok.50.hi.gz /home/smt/HindiMachineTranslationSystem/1/data/train/train.tok.50.en.gz
  took 10 seconds (10s)
[train-lowercase-hi] rebuilding...
  dep=/home/smt/HindiMachineTranslationSystem/1/data/train/train.tok.50.hi.gz [CHANGED]
  dep=/home/smt/HindiMachineTranslationSystem/1/data/train/train.tok.50.lc.hi [NOT FOUND]
  cmd=gzip -cd /home/smt/HindiMachineTranslationSystem/1/data/train/train.tok.50.hi.gz | /home/smt/joshua-v6.0.1/scripts/lowercase.perl > /home/smt/HindiMachineTranslationSystem/1/data/train/train.tok.50.lc.hi
  took 2 seconds (2s)
[train-lowercase-en] rebuilding...
  dep=/home/smt/HindiMachineTranslationSystem/1/data/train/train.tok.50.en.gz [CHANGED]
  dep=/home/smt/HindiMachineTranslationSystem/1/data/train/train.tok.50.lc.en [NOT FOUND]
  cmd=gzip -cd /home/smt/HindiMachineTranslationSystem/1/data/train/train.tok.50.en.gz | /home/smt/joshua-v6.0.1/scripts/lowercase.perl > /home/smt/HindiMachineTranslationSystem/1/data/train/train.tok.50.lc.en
  took 0 seconds (0s)
[train-vocab-hi] rebuilding...
  dep=/home/smt/HindiMachineTranslationSystem/1/data/train/corpus.hi [CHANGED]
  dep=/home/smt/HindiMachineTranslationSystem/1/data/train/vocab.hi [NOT FOUND]
  cmd=cat /home/smt/HindiMachineTranslationSystem/1/data/train/corpus.hi | /home/smt/joshua-v6.0.1/scripts/training/build-vocab.pl > /home/smt/HindiMachineTranslationSystem/1/data/train/vocab.hi
  took 2 seconds (2s)
[train-vocab-en] rebuilding...
  dep=/home/smt/HindiMachineTranslationSystem/1/data/train/corpus.en [CHANGED]
  dep=/home/smt/HindiMachineTranslationSystem/1/data/train/vocab.en [NOT FOUND]
  cmd=cat /home/smt/HindiMachineTranslationSystem/1/data/train/corpus.en | /home/smt/joshua-v6.0.1/scripts/training/build-vocab.pl > /home/smt/HindiMachineTranslationSystem/1/data/train/vocab.en
  took 1 seconds (1s)
[tune-copy-hi] rebuilding...
  dep=/home/smt/HindiMachineTranslationSystem/input/tune.hi [CHANGED]
  dep=/home/smt/HindiMachineTranslationSystem/1/data/tune/tune.hi.gz [NOT FOUND]
  cmd=cat /home/smt/HindiMachineTranslationSystem/input/tune.hi | gzip -9n > /home/smt/HindiMachineTranslationSystem/1/data/tune/tune.hi.gz
  took 0 seconds (0s)
[tune-copy-en] rebuilding...
  dep=/home/smt/HindiMachineTranslationSystem/input/tune.en [CHANGED]
  dep=/home/smt/HindiMachineTranslationSystem/1/data/tune/tune.en.gz [NOT FOUND]
  cmd=cat /home/smt/HindiMachineTranslationSystem/input/tune.en | gzip -9n > /home/smt/HindiMachineTranslationSystem/1/data/tune/tune.en.gz
  took 0 seconds (0s)
[tune-tokenize-hi] rebuilding...
  dep=/home/smt/HindiMachineTranslationSystem/1/data/tune/tune.hi.gz [CHANGED]
  dep=/home/smt/HindiMachineTranslationSystem/1/data/tune/tune.tok.hi.gz [NOT FOUND]
  cmd=/home/smt/joshua-v6.0.1/scripts/training/scat /home/smt/HindiMachineTranslationSystem/1/data/tune/tune.hi.gz | /home/smt/joshua-v6.0.1/scripts/training/normalize-punctuation.pl hi | /home/smt/joshua-v6.0.1/scripts/training/penn-treebank-tokenizer.perl -l hi 2> /dev/null | gzip -9n > /home/smt/HindiMachineTranslationSystem/1/data/tune/tune.tok.hi.gz
  took 0 seconds (0s)
[tune-tokenize-en] rebuilding...
  dep=/home/smt/HindiMachineTranslationSystem/1/data/tune/tune.en.gz [CHANGED]
  dep=/home/smt/HindiMachineTranslationSystem/1/data/tune/tune.tok.en.gz [NOT FOUND]
  cmd=/home/smt/joshua-v6.0.1/scripts/training/scat /home/smt/HindiMachineTranslationSystem/1/data/tune/tune.en.gz | /home/smt/joshua-v6.0.1/scripts/training/normalize-punctuation.pl en | /home/smt/joshua-v6.0.1/scripts/training/penn-treebank-tokenizer.perl -l en 2> /dev/null | gzip -9n > /home/smt/HindiMachineTranslationSystem/1/data/tune/tune.tok.en.gz
^[[1;2A  took 0 seconds (0s)
[tune-lowercase-hi] rebuilding...
  dep=/home/smt/HindiMachineTranslationSystem/1/data/tune/tune.tok.hi.gz [CHANGED]
  dep=/home/smt/HindiMachineTranslationSystem/1/data/tune/tune.tok.lc.hi [NOT FOUND]
  cmd=gzip -cd /home/smt/HindiMachineTranslationSystem/1/data/tune/tune.tok.hi.gz | /home/smt/joshua-v6.0.1/scripts/lowercase.perl > /home/smt/HindiMachineTranslationSystem/1/data/tune/tune.tok.lc.hi
  took 0 seconds (0s)
[tune-lowercase-en] rebuilding...
  dep=/home/smt/HindiMachineTranslationSystem/1/data/tune/tune.tok.en.gz [CHANGED]
  dep=/home/smt/HindiMachineTranslationSystem/1/data/tune/tune.tok.lc.en [NOT FOUND]
  cmd=gzip -cd /home/smt/HindiMachineTranslationSystem/1/data/tune/tune.tok.en.gz | /home/smt/joshua-v6.0.1/scripts/lowercase.perl > /home/smt/HindiMachineTranslationSystem/1/data/tune/tune.tok.lc.en
  took 0 seconds (0s)
[tune-vocab-hi] rebuilding...
  dep=/home/smt/HindiMachineTranslationSystem/1/data/tune/corpus.hi [CHANGED]
  dep=/home/smt/HindiMachineTranslationSystem/1/data/tune/vocab.hi [NOT FOUND]
  cmd=cat /home/smt/HindiMachineTranslationSystem/1/data/tune/corpus.hi | /home/smt/joshua-v6.0.1/scripts/training/build-vocab.pl > /home/smt/HindiMachineTranslationSystem/1/data/tune/vocab.hi
  took 0 seconds (0s)
[tune-vocab-en] rebuilding...
  dep=/home/smt/HindiMachineTranslationSystem/1/data/tune/corpus.en [CHANGED]
  dep=/home/smt/HindiMachineTranslationSystem/1/data/tune/vocab.en [NOT FOUND]
  cmd=cat /home/smt/HindiMachineTranslationSystem/1/data/tune/corpus.en | /home/smt/joshua-v6.0.1/scripts/training/build-vocab.pl > /home/smt/HindiMachineTranslationSystem/1/data/tune/vocab.en
  took 0 seconds (0s)
[test-copy-hi] rebuilding...
  dep=/home/smt/HindiMachineTranslationSystem/input/test.hi [CHANGED]
  dep=/home/smt/HindiMachineTranslationSystem/1/data/test/test.hi.gz [NOT FOUND]
  cmd=cat /home/smt/HindiMachineTranslationSystem/input/test.hi | gzip -9n > /home/smt/HindiMachineTranslationSystem/1/data/test/test.hi.gz
  took 1 seconds (1s)
[test-copy-en] rebuilding...
  dep=/home/smt/HindiMachineTranslationSystem/input/test.en [CHANGED]
  dep=/home/smt/HindiMachineTranslationSystem/1/data/test/test.en.gz [NOT FOUND]
  cmd=cat /home/smt/HindiMachineTranslationSystem/input/test.en | gzip -9n > /home/smt/HindiMachineTranslationSystem/1/data/test/test.en.gz
  took 0 seconds (0s)
[test-tokenize-hi] rebuilding...
  dep=/home/smt/HindiMachineTranslationSystem/1/data/test/test.hi.gz [CHANGED]
  dep=/home/smt/HindiMachineTranslationSystem/1/data/test/test.tok.hi.gz [NOT FOUND]
  cmd=/home/smt/joshua-v6.0.1/scripts/training/scat /home/smt/HindiMachineTranslationSystem/1/data/test/test.hi.gz | /home/smt/joshua-v6.0.1/scripts/training/normalize-punctuation.pl hi | /home/smt/joshua-v6.0.1/scripts/training/penn-treebank-tokenizer.perl -l hi 2> /dev/null | gzip -9n > /home/smt/HindiMachineTranslationSystem/1/data/test/test.tok.hi.gz
  took 0 seconds (0s)
[test-tokenize-en] rebuilding...
  dep=/home/smt/HindiMachineTranslationSystem/1/data/test/test.en.gz [CHANGED]
  dep=/home/smt/HindiMachineTranslationSystem/1/data/test/test.tok.en.gz [NOT FOUND]
  cmd=/home/smt/joshua-v6.0.1/scripts/training/scat /home/smt/HindiMachineTranslationSystem/1/data/test/test.en.gz | /home/smt/joshua-v6.0.1/scripts/training/normalize-punctuation.pl en | /home/smt/joshua-v6.0.1/scripts/training/penn-treebank-tokenizer.perl -l en 2> /dev/null | gzip -9n > /home/smt/HindiMachineTranslationSystem/1/data/test/test.tok.en.gz
  took 0 seconds (0s)
[test-lowercase-hi] rebuilding...
  dep=/home/smt/HindiMachineTranslationSystem/1/data/test/test.tok.hi.gz [CHANGED]
  dep=/home/smt/HindiMachineTranslationSystem/1/data/test/test.tok.lc.hi [NOT FOUND]
  cmd=gzip -cd /home/smt/HindiMachineTranslationSystem/1/data/test/test.tok.hi.gz | /home/smt/joshua-v6.0.1/scripts/lowercase.perl > /home/smt/HindiMachineTranslationSystem/1/data/test/test.tok.lc.hi
  took 1 seconds (1s)
[test-lowercase-en] rebuilding...
  dep=/home/smt/HindiMachineTranslationSystem/1/data/test/test.tok.en.gz [CHANGED]
  dep=/home/smt/HindiMachineTranslationSystem/1/data/test/test.tok.lc.en [NOT FOUND]
  cmd=gzip -cd /home/smt/HindiMachineTranslationSystem/1/data/test/test.tok.en.gz | /home/smt/joshua-v6.0.1/scripts/lowercase.perl > /home/smt/HindiMachineTranslationSystem/1/data/test/test.tok.lc.en
  took 0 seconds (0s)
[test-vocab-hi] rebuilding...
  dep=/home/smt/HindiMachineTranslationSystem/1/data/test/corpus.hi [CHANGED]
  dep=/home/smt/HindiMachineTranslationSystem/1/data/test/vocab.hi [NOT FOUND]
  cmd=cat /home/smt/HindiMachineTranslationSystem/1/data/test/corpus.hi | /home/smt/joshua-v6.0.1/scripts/training/build-vocab.pl > /home/smt/HindiMachineTranslationSystem/1/data/test/vocab.hi
  took 0 seconds (0s)
[test-vocab-en] rebuilding...
  dep=/home/smt/HindiMachineTranslationSystem/1/data/test/corpus.en [CHANGED]
  dep=/home/smt/HindiMachineTranslationSystem/1/data/test/vocab.en [NOT FOUND]
  cmd=cat /home/smt/HindiMachineTranslationSystem/1/data/test/corpus.en | /home/smt/joshua-v6.0.1/scripts/training/build-vocab.pl > /home/smt/HindiMachineTranslationSystem/1/data/test/vocab.en
  took 0 seconds (0s)
[source-numlines] rebuilding...
  dep=/home/smt/HindiMachineTranslationSystem/1/data/train/corpus.en [CHANGED]
  cmd=cat /home/smt/HindiMachineTranslationSystem/1/data/train/corpus.en | wc -l
  took 0 seconds (0s)
[source-numlines] retrieved cached result => 222915
[berkeley-aligner-chunk-0] rebuilding...
  dep=alignments/0/word-align.conf [CHANGED]
  dep=/home/smt/HindiMachineTranslationSystem/1/data/train/splits/corpus.en.0 [CHANGED]
  dep=/home/smt/HindiMachineTranslationSystem/1/data/train/splits/corpus.hi.0 [CHANGED]
  dep=alignments/0/training.align [NOT FOUND]
  cmd=java -d64 -Xmx10g -jar /home/smt/joshua-v6.0.1/lib/berkeleyaligner.jar ++alignments/0/word-align.conf
  took 1809 seconds (30m9s)
[aligner-combine] rebuilding...
  dep=alignments/0/training.align [CHANGED]
  dep=alignments/training.align [NOT FOUND]
  cmd=cat alignments/0/training.align > alignments/training.align
  took 1 seconds (1s)
[thrax-input-file] rebuilding...
  dep=/home/smt/HindiMachineTranslationSystem/1/data/train/corpus.en [CHANGED]
  dep=/home/smt/HindiMachineTranslationSystem/1/data/train/corpus.hi [CHANGED]
  dep=alignments/training.align [CHANGED]
  dep=/home/smt/HindiMachineTranslationSystem/1/data/train/thrax-input-file [NOT FOUND]
  cmd=paste /home/smt/HindiMachineTranslationSystem/1/data/train/corpus.en /home/smt/HindiMachineTranslationSystem/1/data/train/corpus.hi alignments/training.align | perl -pe 's/\t/ ||| /g' | grep -v '()' | grep -v '||| \+$' > /home/smt/HindiMachineTranslationSystem/1/data/train/thrax-input-file
  took 1 seconds (1s)
[thrax-run] rebuilding...
  dep=/home/smt/HindiMachineTranslationSystem/1/data/train/thrax-input-file [CHANGED]
  dep=thrax-hiero.conf [CHANGED]
  dep=grammar.gz [NOT FOUND]
  cmd=hadoop/bin/hadoop jar /home/smt/joshua-v6.0.1/thrax/bin/thrax.jar -D mapred.child.java.opts='-Xmx2g' thrax-hiero.conf thrax > thrax.log 2>&1; rm -f grammar grammar.gz; hadoop/bin/hadoop fs -getmerge thrax/final/ grammar.gz; hadoop/bin/hadoop fs -rmr thrax
  took 393 seconds (6m33s)
[lm-sort-uniq] rebuilding...
  dep=/home/smt/HindiMachineTranslationSystem/1/data/train/corpus.hi [CHANGED]
  dep=/home/smt/HindiMachineTranslationSystem/1/data/train/corpus.hi.uniq [NOT FOUND]
  cmd=/home/smt/joshua-v6.0.1/scripts/training/scat /home/smt/HindiMachineTranslationSystem/1/data/train/corpus.hi | sort -u -T /tmp -S 2G | gzip -9n > /home/smt/HindiMachineTranslationSystem/1/data/train/corpus.hi.uniq
  took 29 seconds (29s)
[srilm] rebuilding...
  dep=/home/smt/HindiMachineTranslationSystem/1/data/train/corpus.hi.uniq [CHANGED]
  dep=lm.gz [NOT FOUND]
  cmd=/home/smt/pj/srilm-1.7.1/bin/i686-m64/ngram-count -order 3 -interpolate -kndiscount -unk -gt3min 1 -gt4min 1 -gt5min 1 -text /home/smt/HindiMachineTranslationSystem/1/data/train/corpus.hi.uniq  -lm lm.gz
  JOB FAILED (return code 1)
one of modified KneserNey discounts is negative
error in discount estimator for order 1

Any hint why this error happens ? please help me

Matt Post

unread,
Mar 30, 2015, 9:43:03 AM3/30/15
to joshua_...@googlegroups.com
It looks like your training data is too small for Kneser-Ney smoothing. You can try adding the --witten-bell flag.

You could also build the lm separately, then pass the following flags to the pipeline:

--lmfile /path/to/the/lm/you/built --no-build-lm

where --no-build-lm tells the pipeline not to build another LM from the target side of your training data.

matt


--
You received this message because you are subscribed to the Google Groups "Joshua Technical Support" group.
To unsubscribe from this group and stop receiving emails from it, send an email to joshua_suppor...@googlegroups.com.
To post to this group, send email to joshua_...@googlegroups.com.
Visit this group at http://groups.google.com/group/joshua_support.
For more options, visit https://groups.google.com/d/optout.

Reply all
Reply to author
Forward
0 new messages