Definition of train, tune and test within Joshua Pipeline context

Lewis John Mcgibbney

unread,

Jan 11, 2016, 2:17:31 PM1/11/16

to Joshua Developers

Hi Folks,
I've acquired two datasets (one Russian and it's equivalent in English) which comply with the following statement from [0] which is "These files should be parallel at the sentence level (with one sentence per line), should be in UTF-8, and should be untokenized (tokenization occurs in the pipeline). SOURCE and TARGET denote variables that should be replaced with the actual target and source language abbreviations (e.g., “ur” and “en”)."
I am in the process of experimenting with generating and packaging language packs and have a few questions which I'll be posting here.
My initial question relates to the documentation at [0] specifically what the following files are actually meant to be

      train.SOURCE
      train.TARGET
      tune.SOURCE
      tune.TARGET
      test.SOURCE
      test.TARGET

For example, I've got two files which I've renamed train.en and train.ru however what am I meant to provide for tune and test?
If someone can please answer the above then I will keep moving through the language pack generation.
Thank in advance folks.
Lewis

[0] http://joshua-decoder.org/6.0/pipeline.html

Matt Post

unread,

Jan 11, 2016, 2:29:02 PM1/11/16

to joshua_d...@googlegroups.com

Hi Lewis,

1. MT models are built from parallel training data, as you know. This is the --train argument.

2. In addition, MT models need to be tuned. There are a lot of parameters in the model that tell it which of its components to trust, and these parameters have to be set to help the MT system produce translations that correlate with a good metric (which is BLEU score, for MT). Defaults can't really be used, because the right parameters differ for every language pair. This is the --tune argument. Typically, tuning data is just a few thousand sentences, whereas training data can be as many as many millions. If you just have a big parallel dataset, you can take a few thousand (I'd suggest 3–5k) sentences randomly, and reserve them for tuning (make sure to remove them from train).

3. Often in research, we wish to test how well a tuned model will generalize. For that scenario, we typically have a separate test set. If you're just building a language pack, you probably don't need this (and you can tell the pipeline to stop at tuning, "--last-step tune").

matt

--
You received this message because you are subscribed to the Google Groups "Joshua Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to joshua_develop...@googlegroups.com.
To post to this group, send email to joshua_d...@googlegroups.com.
Visit this group at https://groups.google.com/group/joshua_developers.
For more options, visit https://groups.google.com/d/optout.

Lewis John Mcgibbney

unread,

Jan 11, 2016, 2:34:44 PM1/11/16

to joshua_d...@googlegroups.com

Dynamite, thank you.

--
You received this message because you are subscribed to a topic in the Google Groups "Joshua Developers" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/joshua_developers/wVk5ewctIWE/unsubscribe.
To unsubscribe from this group and all its topics, send an email to joshua_develop...@googlegroups.com.

To post to this group, send email to joshua_d...@googlegroups.com.
Visit this group at https://groups.google.com/group/joshua_developers.
For more options, visit https://groups.google.com/d/optout.