Glue suggestion

37 views

Skip to first unread message

Kenneth Heafield

unread,

Jun 19, 2013, 2:15:00 PM6/19/13

to jane-...@googlegroups.com

Dear Jane,

Section 8.3.1 of your manual says "In the standard formulation of the hierarchical phrase-based translation model two additional rules are added:"

0 0 0 0 0 0 0 1 0 # S # X~0 # X~0 # 1 1 1 1 1
0 0 0 0 0 0 0 1 1 # S # S~0 X~1 # S~0 X~1 # 1 1 1 1 1

I ran Moses and cdec on the same Hiero model and, after accounting for slight differences in feature definition, was puzzled as to why Moses found better hypotheses with the same cube pruning pop limit. The difference came down to glue rule formulation.

The Jane manual and cdec effectively come with these glue rules:

S -> X
S -> S X
GOAL -> <s> S </s>

along with the hard-coded constraint that S -> X only applies for target-side position 0.

Moses and Joshua (after I converted Joshua to the Moses way) come with these glue rules:

S -> <s>
S -> S X
GOAL -> S </s>

where <s> and </s> are first-class words like any other. This also means that Moses charges two extra word penalties (<s> and </s>) and one extra glue rule application; I have subtracted these out of the model score for purposes of comparison.

In the Moses formulation, glued hypotheses know that they are bound to the beginning of sentence <s>. Therefore, their left language model state is empty and more hypotheses can recombine. This is better than Jane/cdec, where hypotheses are informed about <s> and </s> only after forming a complete sentence, so the left states are spuriously ambiguous. Moreover, hypothesis score estimates are more accurate when they know about <s> earlier.

To confirm this intuition, I convinced cdec to run with the Moses glue grammar. It improved from the blue curve to the green curve that you see in the attached plot. The pop limits range from 25 to 1000.

Kenneth