speed of the cdec SCFG decoding

30 views
Skip to first unread message

chen

unread,
Aug 4, 2012, 8:35:30 PM8/4/12
to cdec-...@googlegroups.com
hello,
 
I used thrax to extract SCFG from a corpus of about 1 million line. It generate a rule file of about 10 million lines.
Then I  fed this rule file to cdec to do translation, but the speed is very slow. I wonder   whether I missed to set some parameters.
I noticed the cdec generate the initial forest with large amount of edges, for example " Init. forest       (paths): 2.21633e+234".
 
The following is my configuration file for cdec.
--------------------------
cubepruning_pop_limit=30
scfg_max_span_limit=15
feature_function=KLanguageModel xin_eng_fbis.tok.lower.order5.srilm.kenlm
feature_function=WordPenalty
add_pass_through_rules=true
grammar=final_rules
formalism=scfg

Chris Dyer

unread,
Aug 4, 2012, 9:01:40 PM8/4/12
to cdec-...@googlegroups.com
Neither cdec nor Thrax do any filtering of the grammar, so this
combination by itself can lead to blow ups in the size of the forest.
In particular, due to alignment errors, common "words" like the comma
symbol or words like "the" may have a very long tail of many thousands
of bad translations.

I typically filter the rules by the phrasal frequency p(e|f), keeping
the top 30 or so rules. This will solve the problem.

Alternatively, you can use the suffix array grammar extractor
(https://github.com/redpony/cdec/blob/master/python/README.md) which
samples the rules proportional to p(e|f) with 300 samples in total.

-Chris

chen

unread,
Aug 5, 2012, 12:25:49 AM8/5/12
to cdec-...@googlegroups.com, cd...@cs.cmu.edu
Thanks very much.

在 2012年8月5日星期日UTC+8上午9时01分40秒,Chris Dyer写道:
Reply all
Reply to author
Forward
0 new messages