speed of the cdec SCFG decoding

chen

unread,

Aug 4, 2012, 8:35:30 PM8/4/12

to cdec-...@googlegroups.com

hello,

I used thrax to extract SCFG from a corpus of about 1 million line. It generate a rule file of about 10 million lines.

Then I fed this rule file to cdec to do translation, but the speed is very slow. I wonder whether I missed to set some parameters.

I noticed the cdec generate the initial forest with large amount of edges, for example " Init. forest (paths): 2.21633e+234".

The following is my configuration file for cdec.

--------------------------

cubepruning_pop_limit=30
scfg_max_span_limit=15
feature_function=KLanguageModel xin_eng_fbis.tok.lower.order5.srilm.kenlm
feature_function=WordPenalty
add_pass_through_rules=true
grammar=final_rules
formalism=scfg

Chris Dyer

unread,

Aug 4, 2012, 9:01:40 PM8/4/12

to cdec-...@googlegroups.com

Neither cdec nor Thrax do any filtering of the grammar, so this
combination by itself can lead to blow ups in the size of the forest.
In particular, due to alignment errors, common "words" like the comma
symbol or words like "the" may have a very long tail of many thousands
of bad translations.

I typically filter the rules by the phrasal frequency p(e|f), keeping
the top 30 or so rules. This will solve the problem.

Alternatively, you can use the suffix array grammar extractor
(https://github.com/redpony/cdec/blob/master/python/README.md) which
samples the rules proportional to p(e|f) with 300 samples in total.

-Chris

chen

unread,

Aug 5, 2012, 12:25:49 AM8/5/12

to cdec-...@googlegroups.com, cd...@cs.cmu.edu

Thanks very much.

在 2012年8月5日星期日UTC+8上午9时01分40秒，Chris Dyer写道：

Reply all

Reply to author

Forward