Joshua v6.0.4 with "--type phrase" problem

98 views
Skip to first unread message

Reza Lesmana

unread,
Jun 19, 2015, 5:08:10 AM6/19/15
to joshua_d...@googlegroups.com
Hi devs,

Thanks a lot for your help from my previous experiments. I'm here to report that the test returns final BLEU 0.2391
Eventhough it's lower than other research I've found on English-Indonesian translation system (reporting final BLEU 30.xx%),
but I'm happy since mine goes with default settings, while this other research has used more advanced Part of Speech tagging feature.
Thank you very much.

However, I would like to try to improve the BLEU, and one that I could think of is by using phrase-based method.
From what I know, to do phrase-based, I only need two things :
1. To install Moses and then define $MOSES in the environment variable to refer my Moses installation folder. (done)
2. Use "--type phrase" parameter in the pipeline invocation. (done)

This is the pipeline invocation I use :

-------------------------------------------------------------------------------
 $JOSHUA/bin/pipeline.pl --type phrase --corpus input/train --tune input/tune --test input/test --source en --target id \
--aligner berkeley --joshua-mem 10g --threads 4 --hadoop-mem 10g

-------------------------------------------------------------------------------

This is almost the same as my previous successful pipeline invocation, but with added "--type phrase" parameter.
Please tell me if I did something wrong or missing something to invoke phrase-based method.

It all goes well until the [filter-tune] step. But on the [tune-bundle] steps, I'm stumbled on exception from Java.

Here is the tail of console output :

---------------------------------------------------------------------------------
[filter-tune] rebuilding...
  dep=model/phrase-table.gz [CHANGED]
  dep=/media/thesis/working_moses/data/tune/tune.tok.lc.en
  dep=/media/thesis/working_moses/data/tune/grammar.filtered.gz
  cmd=/media/thesis/joshua-v6.0.4/scripts/support/filter_grammar.sh -g model/phrase-table.gz -f -v /media/thesis/working_moses/data/tune/tune.tok.lc.en | /media/thesis/joshua-v6.0.4/scripts/training/filter-rules.pl -bus3 | gzip -9n > /media/thesis/working_moses/data/tune/grammar.filtered.gz
  took 26 seconds (26s)
[tune-bundle] rebuilding...
  dep=/media/thesis/joshua-v6.0.4/scripts/training/templates/tune/joshua.config
  dep=/media/thesis/working_moses/data/tune/grammar.filtered.gz [CHANGED]
  cmd=/media/thesis/joshua-v6.0.4/scripts/support/run_bundler.py --force --symlink --absolute --verbose /media/thesis/joshua-v6.0.4/scripts/training/templates/tune/joshua.config /media/thesis/working_moses/tune/model --copy-config-options '-top-n 300 -output-format "%i ||| %s ||| %f ||| %c" -mark-oovs false -search stack -weights "lm_0 1 Distortion 1.0 PhrasePenalty 1.0 tm_pt_2 1 tm_pt_1 1 tm_pt_4 1 tm_pt_0 1 tm_pt_3 1 " -feature-function "StateMinimizingLanguageModel -lm_order 5 -lm_file /media/thesis/working_moses/lm.kenlm" -feature-function "Distortion" -feature-function "PhrasePenalty"  -tm0/type moses -tm0/owner pt -tm0/maxspan 0 -tm1 DELETE' --pack-tm /media/thesis/working_moses/data/tune/grammar.filtered.gz
  JOB FAILED (return code 2)
* Running the copy-config.pl script with the command: /media/thesis/joshua-v6.0.4/scripts/copy-config.pl -top-n 300 -output-format "%i ||| %s ||| %f ||| %c" -mark-oovs false -search stack -weights "lm_0 1 Distortion 1.0 PhrasePenalty 1.0 tm_pt_2 1 tm_pt_1 1 tm_pt_4 1 tm_pt_0 1 tm_pt_3 1 " -feature-function "StateMinimizingLanguageModel -lm_order 5 -lm_file /media/thesis/working_moses/lm.kenlm" -feature-function "Distortion" -feature-function "PhrasePenalty"  -tm0/type moses -tm0/owner pt -tm0/maxspan 0 -tm1 DELETE
* Looking for a path in the line:
    feature-function = StateMinimizingLanguageModel -lm_order 5 -lm_file /media/thesis/working_moses/lm.kenlm
* * Found path "/media/thesis/working_moses/lm.kenlm"
* Creating destination directory "/media/thesis/working_moses/tune/model"
* Packing grammar at "/media/thesis/working_moses/data/tune/grammar.filtered.gz" to "/media/thesis/working_moses/tune/model/grammar.filtered.gz.packed"
* Running the grammar-packer.pl script with the command: /media/thesis/joshua-v6.0.4/scripts/support/grammar-packer.pl /media/thesis/working_moses/data/tune/grammar.filtered.gz /media/thesis/working_moses/tune/model/grammar.filtered.gz.packed
Jun 19, 2015 8:41:24 AM joshua.tools.GrammarPacker main
INFO: Will be writing to /media/thesis/working_moses/tune/model/grammar.filtered.gz.packed
Jun 19, 2015 8:41:24 AM joshua.tools.GrammarPacker <init>
INFO: No alignments file or grammar specified, skipping.
Jun 19, 2015 8:41:24 AM joshua.tools.GrammarPacker <init>
INFO: No config specified. Attempting auto-detection of feature types.
Jun 19, 2015 8:41:24 AM joshua.tools.GrammarPacker pack
INFO: Beginning exploration pass.
Jun 19, 2015 8:41:24 AM joshua.tools.GrammarPacker pack
INFO: Exploring: /tmp/grammar.filtered.gzNXM_
........10........20........30........40........50........60........70........80........90.....100%
Jun 19, 2015 8:41:38 AM joshua.tools.GrammarPacker pack
INFO: Exploration pass complete. Freezing vocabulary and finalizing encoders.
Jun 19, 2015 8:41:38 AM joshua.util.encoding.FeatureTypeAnalyzer inferTypes
INFO: Type inferred: 0 is float
Jun 19, 2015 8:41:38 AM joshua.util.encoding.FeatureTypeAnalyzer inferTypes
INFO: Type inferred: 1 is float
Jun 19, 2015 8:41:38 AM joshua.util.encoding.FeatureTypeAnalyzer inferTypes
INFO: Type inferred: 2 is float
Jun 19, 2015 8:41:38 AM joshua.util.encoding.FeatureTypeAnalyzer inferTypes
INFO: Type inferred: 3 is float
Jun 19, 2015 8:41:38 AM joshua.tools.GrammarPacker pack
INFO: Type inference complete.
Jun 19, 2015 8:41:38 AM joshua.tools.GrammarPacker pack
INFO: Finalizing encoding.
Jun 19, 2015 8:41:38 AM joshua.tools.GrammarPacker pack
INFO: Writing encoding.
Jun 19, 2015 8:41:38 AM joshua.tools.GrammarPacker pack
INFO: Freezing vocab.
Jun 19, 2015 8:41:39 AM joshua.tools.GrammarPacker pack
INFO: Writing vocab.
Writing vocabulary: 47675 tokens
Jun 19, 2015 8:41:39 AM joshua.tools.GrammarPacker pack
INFO: Writing config to '/media/thesis/working_moses/tune/model/grammar.filtered.gz.packed/config'
Jun 19, 2015 8:41:39 AM joshua.tools.GrammarPacker pack
INFO: Reading encoding.
Jun 19, 2015 8:41:39 AM joshua.tools.GrammarPacker pack
INFO: Beginning packing pass.
Jun 19, 2015 8:41:39 AM joshua.tools.GrammarPacker$FeatureBuffer <init>
INFO: Encoding feature ids in: byte
Exception in thread "main" java.lang.NumberFormatException: For input string: "p"
        at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
        at java.lang.Integer.parseInt(Integer.java:580)
        at java.lang.Integer.parseInt(Integer.java:615)
        at joshua.util.FormatUtils.getNonterminalIndex(FormatUtils.java:61)
        at joshua.tools.GrammarPacker.binarize(GrammarPacker.java:400)
        at joshua.tools.GrammarPacker.pack(GrammarPacker.java:179)
        at joshua.tools.GrammarPacker.main(GrammarPacker.java:616)
* FATAL: Couldn't pack the grammar.
* __init__() takes at least 3 arguments (2 given)

-----------------------------------------------------------------------------------------------------

Could anyone please help me to solve this problem?

Thanks a lot.

Regards,
Reza Lesmana

Reza Lesmana

unread,
Jun 20, 2015, 3:14:58 AM6/20/15
to joshua_d...@googlegroups.com
Hi devs,

This is my effort to track the problem.

I'm following the java program joshua.tools.GrammarPacker, and I've seen that it reads the grammar.filtered.gz (according to the console output).

The grammar.filtered file line samples is below :

----------------------------------------------------------------------------------
" ||| Buku-buku ||| 0.00232983 0.1 0.00232983 0.0344828 ||| 0-0 ||| 8 8 1 ||| |||
" ||| Masa ||| 0.000582457 0.02 0.00232983 0.0344828 ||| 0-0 ||| 32 8 1 ||| |||
" ||| dapat ||| 3.11006e-06 7.87e-05 0.00232983 0.0344828 ||| 0-0 ||| 5993 8 1 ||| |||
" ||| di ||| 8.36976e-07 1.95e-05 0.00232983 0.0344828 ||| 0-0 ||| 22269 8 1 ||| |||
" ||| dituduh..." ||| 0.00931931 1 0.00232983 0.0344828 ||| 0-0 ||| 2 8 1 ||| |||
" ||| favorit ||| 0.000219278 0.0056497 0.00232983 0.0344828 ||| 0-0 ||| 85 8 1 ||| |||
" ||| kambing ||| 0.000745545 0.0333333 0.00232983 0.0344828 ||| 0-0 ||| 25 8 1 ||| |||
" ||| saya ||| 1.2825e-06 3.77e-05 0.00232983 0.0344828 ||| 0-0 ||| 14533 8 1 ||| |||
$ ||| USD ||| 0.000642711 0.0357143 0.00931931 0.2 ||| 0-0 ||| 29 2 1 ||| |||
$ ||| rumah ||| 1.28898e-05 0.0003061 0.00931931 0.2 ||| 0-0 ||| 1446 2 1 ||| |||
'[God] ||| domba-domba, dipimpin-Nya mereka seperti kawanan ||| 0.00931931 1 0.00372772 7.42756e-17 ||| 0-0 ||| 2 5 1 ||| |||
'[God] ||| domba-domba, dipimpin-Nya mereka seperti ||| 0.00931931 1 0.00372772 5.94205e-12 ||| 0-0 ||| 2 5 1 ||| |||
'[God] ||| domba-domba, dipimpin-Nya mereka ||| 0.00931931 1 0.00372772 1.85382e-09 ||| 0-0 ||| 2 5 1 ||| |||
'[God] ||| domba-domba, dipimpin-Nya ||| 0.00931931 1 0.00372772 3e-07 ||| 0-0 ||| 2 5 1 ||| |||
'[God] ||| domba-domba, ||| 0.00931931 1 0.00372772 0.333333 ||| 0-0 ||| 2 5 1 ||| |||
'[If] you ||| kasih: ||| 0.00207096 0.00205317 0.00931931 1 ||| 0-0 ||| 9 2 1 ||| |||
'[If] you ||| pilih kasih: ||| 0.00207096 0.00205317 0.00931931 2.77e-05 ||| 0-1 ||| 9 2 1 ||| |||
'[If] ||| kasih: ||| 0.00207096 0.333333 0.00931931 1 ||| 0-0 ||| 9 2 1 ||| |||
'[If] ||| pilih kasih: ||| 0.00207096 0.333333 0.00931931 2.77e-05 ||| 0-1 ||| 9 2 1 ||| |||
-------------------------------------------------------------------------------------------------------------

This is what I'm outlining of the GrammarPacker program to track the problem. Please do tell me if I'm wrong. 

1. It reads the grammar.filtered contents line by line.
2. Every line is splitted into arrays with delimiter "|||"
3. target_words array will be assigned values splitted with whitespace from arrays[1], for example: "domba-domba, dipimpin-Nya mereka seperti kawanan"
into arrays {domba-domba, dipimpin-Nya, mereka, seperti, kawanan}
4. And then, starting from the line 397 of the GrammarPacker, for every item in the target_words array:
a. If the item is nonTerminal (a.k.a length >=3 and starts with "[" and ends with "]"), 
then get the nonTerminal index from the item, by substring(length-2, length-1) --> If I'm not mistaken, this will only take one character before the last "]"
and then parse it as an Integer. And this is where the error happens, because it's parsing a character that is not a number. 

From my grammar.filtered, this is the target_words which contains nonTerminal item (if I'm reading the program correctly)
----------------------------------------------------------------
 ["Apakah arti kesetaraan bagimu?"] ["Pernikahan"] 
 [Batuk] 
 [Kuak] 
 [Ketika] 
 [Penebusku] 
 [yang] 
 [SPP] 
 [bliip] 
 [sekolah] 
 [namun] 
 [kesehatan] 
 masyarakat [yang] mencakup 
 [binatang] yang 
 [Shanghai] lebih 
 telah memanggil [kita] keluar dari 
 memanggil [kita] keluar dari kegelapan 
 [Kamu] 
 [SPP] 
 sini [pertumbuhan] populasi masuk karena 
 pada digunjingkan, [lebih baik] tidak 
 [Ricard] 
 [kanan] adalah cara lisogenik 
 lahir, saya ditolong dukun [beranak] 
 ["Kebebasan"] 
----------------------------------------------------------------

So, based on this, the program will encounter NumberFormatException because every last character of nonTerminal item is not a number.

Did I read the program the wrong way? Can I just delete the lines from grammar.filtered that contains these?

Regards,
Reza Lesmana

Matt Post

unread,
Jun 20, 2015, 6:52:36 AM6/20/15
to joshua_d...@googlegroups.com
Hi,

23.91 sounds like a solid baseline score. Are you using the same data as reported in other research?

Regarding grammar packing, how big is the grammar you are packing? If it's not too big, perhaps you could post a download link to it, and I can take a look.

matt


--
You received this message because you are subscribed to the Google Groups "Joshua Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to joshua_develop...@googlegroups.com.
To post to this group, send email to joshua_d...@googlegroups.com.
Visit this group at http://groups.google.com/group/joshua_developers.
For more options, visit https://groups.google.com/d/optout.

Matt Post

unread,
Jun 20, 2015, 6:57:32 AM6/20/15
to joshua_d...@googlegroups.com
You somehow have an illegal character in your phrase table; brackets [] are not allowed, and should have been removed by the tokenizer during preprocessing, e.g.,

$ echo "[this] is a test" | /Users/post/code/joshua/scripts/training/penn-treebank-tokenizer.perl -l en 2> /dev/null
-LSB- this -RSB- is a test

This is done because [] are used to denote nonterminals for Joshua's hierarchical file format. When packing grammars, Joshua looks to see if the line starts with a [, in which case, it infers it is dealing with a hierarchical grammar. Hierarchical lines look like:

[X] ||| el chico [X,1] ||| the [X,1] boy ||| ...

while phrase-based lines have no nonterminal:

el chico pequeño ||| the little boy ||| ...

Your corpus was not properly tokenized, and the packer is trying to interpret your lines as nonterminals.

What is the pipeline command you are using? It should be calling the tokenizer unless you specify --no-prepare.

matt


Reza Lesmana

unread,
Jun 20, 2015, 8:12:59 AM6/20/15
to joshua_d...@googlegroups.com
Hi, Matt

Thanks a lot for replying (again) :D

To reply for your previous email, no, I don't have the exact same data as reported in other research, since they have licensed data on some parts of the data that they are not able to share publicly. So, mine is about ~15k sentences less than theirs. But, I added my own data processed from WIT3 ( https://wit3.fbk.eu/ ) which is the subtitles of TED talks provided by TED Talk Open Translation project. My tune data and my test data also comes from the WIT3 but mutually exclusive from the training data.

The reason I'm trying to pursue the phrase type, is because I'm interested if by going phrase, the BLEU will have better performance, and it's going to be in my thesis too. And, one of topic in my thesis is about using Joshua for creating baseline, so I need to use some other options of Joshua for the showcase. 

And, yes, that's exactly my thoughts when I view the grammar.filtered contents. --> "Why in the hell it still has these brackets in the content?" 

I use this pipeline invocation (as stated in my first email on this thread) :

---------------------------------------------------------------------------------------

 $JOSHUA/bin/pipeline.pl --type phrase --corpus input/train --tune input/tune --test input/test --source en --target id \
--aligner berkeley --joshua-mem 10g --threads 4 --hadoop-mem 10g
---------------------------------------------------------------------------------------

This is the almost exact command as my previous successful pipeline invocation. The only difference is the "--type phrase" parameter.

And, since I'm curious whether the corpus has been cleaned or not, I try to see the log once more. And, I think the corpus has been tokenized.

This is the lines where the tokenizer runs.

--------------------------------------
.........................
[train-tokenize-en] rebuilding...
  dep=/media/thesis/working_moses2/data/train/train.en.gz [CHANGED]
  dep=/media/thesis/working_moses2/data/train/train.tok.en.gz [NOT FOUND]
  cmd=/media/thesis/joshua-v6.0.4/scripts/training/scat /media/thesis/working_moses2/data/train/train.en.gz | /media/thesis/joshua-v6.0.4/scripts/training/normalize-punctuation.pl en | /media/thesis/joshua-v6.0.4/scripts/training/penn-treebank-tokenizer.perl -l en 2> /dev/null | gzip -9n > /media/thesis/working_moses2/data/train/train.tok.en.gz
  took 27 seconds (27s)
[train-tokenize-id] rebuilding...
  dep=/media/thesis/working_moses2/data/train/train.id.gz [CHANGED]
  dep=/media/thesis/working_moses2/data/train/train.tok.id.gz [NOT FOUND]
  cmd=/media/thesis/joshua-v6.0.4/scripts/training/scat /media/thesis/working_moses2/data/train/train.id.gz | /media/thesis/joshua-v6.0.4/scripts/training/normalize-punctuation.pl id | /media/thesis/joshua-v6.0.4/scripts/training/penn-treebank-tokenizer.perl -l id 2> /dev/null | gzip -9n > /media/thesis/working_moses2/data/train/train.tok.id.gz
  took 25 seconds (25s)
......................................
.... some lines omitted .....
......................................
[tune-tokenize-en] rebuilding...
  dep=/media/thesis/working_moses2/data/tune/tune.en.gz [CHANGED]
  dep=/media/thesis/working_moses2/data/tune/tune.tok.en.gz [NOT FOUND]
  cmd=/media/thesis/joshua-v6.0.4/scripts/training/scat /media/thesis/working_moses2/data/tune/tune.en.gz | /media/thesis/joshua-v6.0.4/scripts/training/normalize-punctuation.pl en | /media/thesis/joshua-v6.0.4/scripts/training/penn-treebank-tokenizer.perl -l en 2> /dev/null | gzip -9n > /media/thesis/working_moses2/data/tune/tune.tok.en.gz
  took 0 seconds (0s)
[tune-tokenize-id] rebuilding...
  dep=/media/thesis/working_moses2/data/tune/tune.id.gz [CHANGED]
  dep=/media/thesis/working_moses2/data/tune/tune.tok.id.gz [NOT FOUND]
  cmd=/media/thesis/joshua-v6.0.4/scripts/training/scat /media/thesis/working_moses2/data/tune/tune.id.gz | /media/thesis/joshua-v6.0.4/scripts/training/normalize-punctuation.pl id | /media/thesis/joshua-v6.0.4/scripts/training/penn-treebank-tokenizer.perl -l id 2> /dev/null | gzip -9n > /media/thesis/working_moses2/data/tune/tune.tok.id.gz
  took 0 seconds (0s)
......................................
.... some lines omitted .....
......................................
[test-tokenize-en] rebuilding...
  dep=/media/thesis/working_moses2/data/test/test.en.gz [CHANGED]
  dep=/media/thesis/working_moses2/data/test/test.tok.en.gz [NOT FOUND]
  cmd=/media/thesis/joshua-v6.0.4/scripts/training/scat /media/thesis/working_moses2/data/test/test.en.gz | /media/thesis/joshua-v6.0.4/scripts/training/normalize-punctuation.pl en | /media/thesis/joshua-v6.0.4/scripts/training/penn-treebank-tokenizer.perl -l en 2> /dev/null | gzip -9n > /media/thesis/working_moses2/data/test/test.tok.en.gz
  took 0 seconds (0s)
[test-tokenize-id] rebuilding...
  dep=/media/thesis/working_moses2/data/test/test.id.gz [CHANGED]
  dep=/media/thesis/working_moses2/data/test/test.tok.id.gz [NOT FOUND]
  cmd=/media/thesis/joshua-v6.0.4/scripts/training/scat /media/thesis/working_moses2/data/test/test.id.gz | /media/thesis/joshua-v6.0.4/scripts/training/normalize-punctuation.pl id | /media/thesis/joshua-v6.0.4/scripts/training/penn-treebank-tokenizer.perl -l id 2> /dev/null | gzip -9n > /media/thesis/working_moses2/data/test/test.tok.id.gz
  took 0 seconds (0s)
.................
.................
--------------------------------------

But, yes, I'm still curious whether the corpus has been properly tokenized or not. I'm afraid that I somehow do not run it from clean folders.
So, I will run the pipeline command once more from clean folders. I will report to you as soon as possible. 

Regards,
Reza Lesmana

Reza Lesmana

unread,
Jun 20, 2015, 10:39:24 AM6/20/15
to joshua_d...@googlegroups.com
Hi, Matt

I've rerun the pipeline invocation from clean folders. It passes the [tune-bundle] :D, so yes, I think I'm not running it from clean folder before.
But, I've encountered another problem, and I've traced it to the [mert] step.

This is the tail of console output :

---------------------------------------------------------------------------------------------------
[tune-bundle] rebuilding...
  dep=/media/thesis/joshua-v6.0.4/scripts/training/templates/tune/joshua.config [CHANGED]
  dep=/media/thesis/workdir_moses/data/tune/grammar.filtered.gz [CHANGED]
  cmd=/media/thesis/joshua-v6.0.4/scripts/support/run_bundler.py --force --symlink --absolute --verbose /media/thesis/joshua-v6.0.4/scripts/training/templates/tune/joshua.config /media/thesis/workdir_moses/tune/model --copy-config-options '-top-n 300 -output-format "%i ||| %s ||| %f ||| %c" -mark-oovs false -search stack -weights "lm_0 1 Distortion 1.0 PhrasePenalty 1.0 tm_pt_3 1 tm_pt_2 1 tm_pt_0 1 tm_pt_1 1 " -feature-function "StateMinimizingLanguageModel -lm_order 5 -lm_file /media/thesis/workdir_moses/lm.kenlm" -feature-function "Distortion" -feature-function "PhrasePenalty"  -tm0/type moses -tm0/owner pt -tm0/maxspan 0 -tm1 DELETE' --pack-tm /media/thesis/workdir_moses/data/tune/grammar.filtered.gz
  took 53 seconds (53s)
[mert] rebuilding...
  dep=/media/thesis/workdir_moses/data/tune/corpus.en [CHANGED]
  dep=/media/thesis/workdir_moses/tune/model/joshua.config [CHANGED]
  dep=tune/model/grammar.filtered.gz.packed/slice_00000.source [CHANGED]
  dep=/media/thesis/workdir_moses/tune/joshua.config.final [NOT FOUND]
  cmd=/media/thesis/joshua-v6.0.4/scripts/training/run_tuner.py /media/thesis/workdir_moses/data/tune/corpus.en /media/thesis/workdir_moses/data/tune/corpus.id --tunedir /media/thesis/workdir_moses/tune --tuner mert --decoder-config /media/thesis/workdir_moses/tune/model/joshua.config --iterations 15
  took 5 seconds (5s)
[filter-test] rebuilding...
  dep=model/phrase-table.gz [CHANGED]
  dep=/media/thesis/workdir_moses/data/test/corpus.en [CHANGED]
  dep=/media/thesis/workdir_moses/data/test/grammar.filtered.gz [NOT FOUND]
  cmd=/media/thesis/joshua-v6.0.4/scripts/support/filter_grammar.sh -g model/phrase-table.gz -f -v /media/thesis/workdir_moses/data/test/corpus.en | /media/thesis/joshua-v6.0.4/scripts/training/filter-rules.pl -bus3 | gzip -9n > /media/thesis/workdir_moses/data/test/grammar.filtered.gz
  took 33 seconds (33s)
[test-bundle] rebuilding...
  dep=/media/thesis/workdir_moses/tune/joshua.config.final [NOT FOUND]
  dep=/media/thesis/workdir_moses/data/test/grammar.filtered.gz [CHANGED]
  dep=/media/thesis/workdir_moses/test/model/joshua.config [NOT FOUND]
  cmd=/media/thesis/joshua-v6.0.4/scripts/support/run_bundler.py --force --symlink --verbose /media/thesis/workdir_moses/tune/joshua.config.final test/model --copy-config-options '-top-n 300 -output-format "%i ||| %s ||| %f ||| %c" -mark-oovs false' --pack-tm /media/thesis/workdir_moses/data/test/grammar.filtered.gz
  JOB FAILED (return code 2)
ERROR:root:ERROR: argument config: can't open '/media/thesis/workdir_moses/tune/joshua.config.final': [Errno 2] No such file or directory: '/media/thesis/workdir_moses/tune/joshua.config.final'
----------------------------------------------------------------------------------------------------------------

In the console output, [test-bundle] failed because it cannot find the $work_dir/tune/joshua.config.final file. This should be available as a result from
tuning steps, right? So, I open the log of the [mert] step.

This is the tail of $work_dir/tune/mert.log :

-----------------------------------
--- Starting Z-MERT iteration #1 @ Sat Jun 20 13:50:48 UTC 2015 ---
Decoding using initial weight vector {1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, -2.844814, 1.0}
Running external decoder...
Call to decoder returned 1; was expecting 0.
Z-MERT exiting prematurely (MertCore returned 30)...
-----------------------------------

And I also open the $work_dir/tune/joshua.log :

---------------------------------------------------
Input 0: 300-best extraction took 0.462 seconds
Input 4: <s> and along the way i 've started noticing - i 'm on my third generation of kids - that they 're getting bigger . </s>
Input 3: 300-best extraction took 0.331 seconds
Input 5: <s> they 're getting sicker . </s>
Input 1: 300-best extraction took 0.748 seconds
Input 6: <s> in addition to these complexities , i just learned that 70 percent of the kids that i see who are labeled learning disabled would not have been had they had proper prenatal nutrition . </s>
Input 2: Collecting options took 0.000 seconds
Input 5: Collecting options took 0.000 seconds
Input 5: Search took 0.020 seconds
Input 5: Translation took 0.128 seconds
Memory used after sentence 5 is 216.5 MB
Translation 5: -34.493 mereka makin asas .
Input 2: FATAL UNCAUGHT EXCEPTION: null
java.lang.NullPointerException
        at joshua.decoder.phrase.Candidate.score(Candidate.java:214)
        at joshua.decoder.phrase.Candidate.compareTo(Candidate.java:136)
        at joshua.decoder.phrase.Candidate.compareTo(Candidate.java:19)
        at java.util.HashMap.compareComparables(HashMap.java:371)
        at java.util.HashMap$TreeNode.treeify(HashMap.java:1920)
        at java.util.HashMap.treeifyBin(HashMap.java:771)
        at java.util.HashMap.putVal(HashMap.java:643)
        at java.util.HashMap.put(HashMap.java:611)
        at java.util.HashSet.add(HashSet.java:219)
        at joshua.decoder.phrase.Stack.addCandidate(Stack.java:125)
        at joshua.decoder.phrase.Stacks.search(Stacks.java:166)
        at joshua.decoder.DecoderThread.translate(DecoderThread.java:113)
        at joshua.decoder.Decoder$DecoderThreadRunner.run(Decoder.java:226)
---------------------------------------------------

Well, that's a new problem for now. 

Is there anything I can do to solve this problem? I've managed to clone Joshua Decoder from github into my Eclipse,
but I don't really know how to build a nightly-release of Joshua. Is there a pointer on how to build Joshua and how to debug it? 
I don't have much experience working with Eclipse before. 

Thanks a lot for your help, Matt. It really means so much to me. :)

Btw, if you would like to try to run it yourself, you can download my data here (temporary) : http://ecomweb.azurewebsites.net/input.zip

This is my pipeline invocation command for this time:


-------------------------
$JOSHUA/bin/pipeline.pl --type phrase --corpus input/train --tune input/tune --test input/test --source en --target id --aligner berkeley --joshua-mem 10g --threads 4
-------------------------

Regards,
Reza Lesmana

Matt Post

unread,
Jun 20, 2015, 7:31:42 PM6/20/15
to joshua_d...@googlegroups.com
I will look at this a bit later. You've likely hit another unexpected input piece, but Joshua should handle this much more gracefully, so I will fix it.

Can you send me /media/thesis/workdir_moses/data/test/corpus.en?

matt


Reza Lesmana

unread,
Jun 21, 2015, 12:08:17 AM6/21/15
to joshua_d...@googlegroups.com
Hi, Matt

Thanks a lot. Do you mean /media/thesis/workdir_moses/data/tune/corpus.en or /media/thesis/workdir_moses/data/test/corpus.en?

Anyway, I've attached both of them. 

/media/thesis/workdir_moses/data/tune/corpus.en links to tune.tok.lc.en

/media/thesis/workdir_moses/data/test/corpus.en links to test.tok.lc.en

Both are in the corpus_en.zip

If you need other files, please let me know.

Regards,
Reza Lesmana
corpus_en.zip

Matt Post

unread,
Jun 22, 2015, 11:03:07 PM6/22/15
to joshua_d...@googlegroups.com
I don't see anything wrong with the input.

If you want to package up the whole working dir and share that with me, I can look. 

It may be a weird threading issue — what if you do just one thread?

matt


<corpus_en.zip>

Reza Lesmana

unread,
Jun 23, 2015, 12:03:33 AM6/23/15
to joshua_d...@googlegroups.com

Hi, Matt

If it's weird threading issue, why the error message is consistent when I try to rerun it?

I'll try to give you the link to the working_dir to you as soon as possible.

Regards,
Reza Lesmana

Matt Post

unread,
Jun 23, 2015, 8:06:07 AM6/23/15
to joshua_d...@googlegroups.com
Yes, a threading issue is unlikely. How big is your working directory? Or how big is the tarball when you type this?

tar czvf for-matt.tgz data/tune/corpus* model lm.kenlm tune

matt


Reza Lesmana

unread,
Jun 28, 2015, 10:08:54 PM6/28/15
to joshua_d...@googlegroups.com
Hi, Matt

Sorry for the long wait. Busy week at work. 

So, here is the one that you asked for :

https://drive.google.com/file/d/0BxxPYYoLfeotRi1SMndZWjFRWUU/view?usp=sharing

It's about 366MB

And this is the whole working directory :

https://drive.google.com/file/d/0BxxPYYoLfeotYTRYRW5EbTg0YTg/view?usp=sharing

It's about 563MB

FYI, this is another run from my local server on VirtualBox. I'm running it with this pipeline command invocation.

-------------------------
$JOSHUA/bin/pipeline.pl --type phrase --corpus input/train --tune input/tune --test input/test --source en --target id --aligner berkeley --joshua-mem 5g --threads 2
-------------------------

It still has the same error during [mert]. 

Regards,
Reza Lesmana

Reza Lesmana

unread,
Jul 23, 2015, 3:21:57 AM7/23/15
to Joshua Developers, lesman...@gmail.com
Hi, Matt

I want to follow up on this problem. Is there any other thing I can help to solve this problem?

Currently, I'm still waiting on this problem to be fixed to get it into my thesis. 

Hopefully, before my deadline, the phrase-based translation feature of Joshua can be used in my thesis. 

Regards,
Reza Lesmana

Matt Post

unread,
Jul 24, 2015, 10:42:43 AM7/24/15
to joshua_d...@googlegroups.com, lesman...@gmail.com
Hi Reza,

Thanks for the prompt. I'll take a look today.

matt

Matt Post

unread,
Jul 28, 2015, 11:24:41 PM7/28/15
to joshua_d...@googlegroups.com, lesman...@gmail.com
Where by "today" I meant today.

I downloaded this and ran it without any problems.

Are you sure you are using Joshua 6.0.4? (what is $JOSHUA set to?) These are strange problems. What platform are you on?

Did you try --threads 1, just to be sure?

matt

Reza Lesmana

unread,
Aug 2, 2015, 8:18:25 AM8/2/15
to Joshua Developers, lesman...@gmail.com
Hi, Matt

Sorry, miss your email previously, and having quite a busy day at work. 

Yes, I'm sure I'm running Joshua 6.0.4.

I'm on Ubuntu server 15.04 64-bit. Using Oracle Java 8 64-bit too. 

I don't think I've run it with "--threads 1" on a clean folder. I'll give it a try first, then I'll report here.

This is quite a strange error. Btw, did you finish the test phase too? I'm curious about the BLEU result for the "phrase type" of JOSHUA.
Could you tell me what is the BLEU result?

And, what is your system setup (OS, java version, and others related)? I'll try to match it up using a VM, see if it goes well. 

Thanks alot, Matt.

Regards,
Reza Lesmana

Matt Post

unread,
Aug 10, 2015, 10:43:32 AM8/10/15
to joshua_d...@googlegroups.com, lesman...@gmail.com
Hi Reza,

Any updates? I ran this on an OS X machine (10.10.4). Do you have another Linux machine you can test on?

I am currently testing a complete rebuild.

matt

Matt Post

unread,
Aug 13, 2015, 1:01:01 PM8/13/15
to joshua_d...@googlegroups.com, lesman...@gmail.com
I ran this start-to-finish over your data, and got a BLEU score of 23.47 on the test set. The tuning set scores were 28.*. 

So everything is working here on my OS X machine. I'd suggest a complete rerun.

matt
Reply all
Reply to author
Forward
0 new messages