Kenneth Heafield
unread,Jun 19, 2013, 2:15:00 PM6/19/13Sign in to reply to author
Sign in to forward
You do not have permission to delete messages in this group
Either email addresses are anonymous for this group or you need the view member email addresses permission to view the original message
to jane-...@googlegroups.com
Dear Jane,
Section 8.3.1 of your manual says "In the standard formulation of the hierarchical phrase-based translation model two additional rules are added:"
0 0 0 0 0 0 0 1 0 # S # X~0 # X~0 # 1 1 1 1 1
0 0 0 0 0 0 0 1 1 # S # S~0 X~1 # S~0 X~1 # 1 1 1 1 1
I ran Moses and cdec on the same Hiero model and, after accounting for
slight differences in feature definition, was puzzled as to why Moses
found better hypotheses with the same cube pruning pop limit. The
difference came down to glue rule formulation.
The Jane manual and cdec effectively come with these glue rules:
S -> X
S -> S X
GOAL -> <s> S </s>
along with the hard-coded constraint that S -> X only applies for
target-side position 0.
Moses and Joshua (after I converted Joshua to the Moses way) come with these glue rules:
S -> <s>
S -> S X
GOAL -> S </s>
where <s> and </s> are first-class words like any other. This also
means that Moses charges two extra word penalties (<s> and </s>) and one
extra glue rule application; I have subtracted these out of the model
score for purposes of comparison.
In the Moses formulation, glued hypotheses know that they are bound to
the beginning of sentence <s>. Therefore, their left language model
state is empty and more hypotheses can recombine. This is better than
Jane/cdec, where hypotheses are informed about <s> and </s> only after
forming a complete sentence, so the left states are spuriously
ambiguous. Moreover, hypothesis score estimates are more accurate when
they know about <s> earlier.
To confirm this intuition, I convinced cdec to run with the Moses glue
grammar. It improved from the blue curve to the green curve that you
see in the attached plot. The pop limits range from 25 to 1000.
Kenneth