what kind of alignment should be used for SCFG extraction

chen

unread,

Aug 6, 2012, 9:52:57 PM8/6/12

to cdec-...@googlegroups.com

Hi,

In the moses phrase-based translation, it would generate three kinds of alginemtn file, which are f-e, e-f, and intersection of f-e and e-f. I wonder which one should be used for the SCFG extraction.

I used the intersection of f-e and e-f, but the result is to satisfactory, so i think i may used the wrong alignment file.

Chris Dyer

unread,

Aug 7, 2012, 12:38:03 AM8/7/12

to cdec-...@googlegroups.com

For SCFG extraction, the usual approach is to use one the alignment
symmetrization heuristics proposed by Franz Och. Typically I have used
the one called "grow-diag-final-and", which the moses tools can
produce.

Alternatively, if you have f-e and e-f alignment files, you can do:

utils/atools -i file.e-f -c invert | utils/atools -i - -j file.f-e -c
grow-diag-final-and > file.gdfa

chen

unread,

Aug 7, 2012, 1:49:59 AM8/7/12

to cdec-...@googlegroups.com, cd...@cs.cmu.edu

Thanks for your reply. I have another question about the scfg rule format. The rule I got from thrax is like the following:

[X] ||| [X,1] 科学家 [X,2] ||| [X,1] , scientists [X,2] ||| Lex(e|f)=1.41634 Lex(f|e)=13.76438 PhrasePenalty=2.71800 p(e|f)=5.10958 p(f|e)=2.96527
[X] ||| [X,1] 科学家 [X,2] ||| [X,1] dissident scientist [X,2] ||| Lex(e|f)=2.06032 Lex(f|e)=2.40940 PhrasePenalty=2.71800 p(e|f)=6.02587 p(f|e)=-0.00000
[X] ||| [X,1] 科学家 [X,2] ||| [X,1] [X,2] of scientists ||| Lex(e|f)=0.73386 Lex(f|e)=1.32741 PhrasePenalty=2.71800 p(e|f)=5.62040 p(f|e)=1.29928
[X] ||| [X,1] 科学家 [X,2] ||| [X,1] science [X,2] ||| Lex(e|f)=4.57600 Lex(f|e)=6.91200 PhrasePenalty=2.71800 p(e|f)=5.33272 p(f|e)=7.26525
[X] ||| [X,1] 科学家 [X,2] ||| [X,1] scientist [X,2] ||| Lex(e|f)=2.06032 Lex(f|e)=0.72300 PhrasePenalty=2.71800 p(e|f)=2.03688 p(f|e)=0.40547
[X] ||| [X,1] 科学家 [X,2] ||| [X,1] scientists to [X,2] ||| Lex(e|f)=0.73386 Lex(f|e)=2.20514 PhrasePenalty=2.71800 p(e|f)=3.13549 p(f|e)=0.7472

I noticed in the cdec sample, the rule format is

[X] ||| [X,1] 科学家 [X,2] ||| [X,1] of researchers , [X,2] ||| -2.47712135315 -1.83505606651 -4.26377058029 2.71828182845905
[X] ||| [X,1] 科学家 [X,2] ||| [X,1] [X,2] by a scientist ||| -2.47712135315 -0.178537979722 -4.84308099747 2.71828182845905
[X] ||| [X,1] 科学家 [X,2] ||| [X,1] [X,2] of scientists ||| -2.47712135315 -0.294755220413 -1.10121524334 2.71828182845905
'
I just delete the words of 'Lex(e|f)=' ,'Lex(f|e)=' and so on, and put the column 'PhrasePenalty' to the end, then fed the rule file to dpmert with the following configuration

--------------------cdec.ini-------------------------------

cubepruning_pop_limit=30
density_prune=100
scfg_max_span_limit=15
feature_function=KLanguageModel xin_eng_fbis.tok.lower.order5.srilm.kenlm
feature_function=WordPenalty
add_pass_through_rules=true
grammar=final_rules

formalism=scfg

---------------the initial weights file----------------

WordPenalty 0.2
PassThrough 0.2
LanguageModel 0.2
LanguageModel_OOV 0.2
PhraseModel_0 0.1
PhraseModel_1 0.1
PhraseModel_2 0.1

The mert tunning goes 9 iterations. However, the translation result is very bad. Do you have any idea about what the problem could be ?

在 2012年8月7日星期二UTC+8下午12时38分03秒，Chris Dyer写道：

Chris Dyer

unread,

Aug 7, 2012, 1:17:27 PM8/7/12

to cdec-...@googlegroups.com

The suffix array rule extractor does not produce "named" rules, so
when cdec reads the file it assigns them default names
PhraseModel_0
PhraseModel_1
PhraseModel_2 ...
So you'll need to change the initial weights you give to cdec to
these. We'll be adding support to the suffix array extractor to make
it behave more like Thrax does fairly soon, but unfortunately the
behavior is a little divergent right now.

-Chris

chen

unread,

Aug 8, 2012, 3:51:29 AM8/8/12

to cdec-...@googlegroups.com, cd...@cs.cmu.edu

Thanks very much.

Another question. Does cdec support multi threads decoding?

在 2012年8月8日星期三UTC+8上午1时17分27秒，Chris Dyer写道：

Chris Dyer

unread,

Aug 8, 2012, 11:38:08 AM8/8/12

to cdec-...@googlegroups.com

No, cdec is not multi-threaded. But, if you use per-sentence grammars
you can run multiple cdecs concurrently with very little extra
overhead. This is possible because KenLM uses mmap to access its
language models (so all language models are represented once in a
machine's physical memory).

Victor Chahuneau

unread,

Aug 8, 2012, 7:27:14 PM8/8/12

to cdec-...@googlegroups.com

Hi Chen,

if you pull the latest version of cdec (some very recent changes have been made) and install pycdec, you can use the following script to translate text with multiple processes:

https://gist.github.com/3299670

Note that you should carefully select the number of jobs (-j) depending on the machine you are using (by default it is set to the number of cores available)

chen

unread,

Aug 9, 2012, 11:16:16 PM8/9/12

to cdec-...@googlegroups.com

Thanks very much

在 2012年8月9日星期四UTC+8上午7时27分14秒，Victor Chahuneau写道：

Reply all

Reply to author

Forward