cdec on moses generated grammar

112 views
Skip to first unread message

jjm

unread,
Nov 28, 2010, 8:27:21 PM11/28/10
to cdec users
Hi,
I can run cdec with grammar rules generated by joshua.
I get errors when I try to run cdec with grammars generated by
moses.
Can cdec be configured to use grammars generated by moses?


If not, should I convert the joshua rule format to the moses format?
Thanks,
John

Hieu Hoang

unread,
Nov 30, 2010, 7:57:05 PM11/30/10
to cdec-...@googlegroups.com
i guess you could write a conversion routine yourself.

the main differences are:
1. moses uses probability. cdec/joshua/hiero using log prob
2. the co-indexes for non-terminals are represented as alignments
in moses. They're embedded in the non-terms in the others.
3. moses has 2 non-term symbols for every non-term on the RHS 'cos
it implements SCFG in chiang's tutorial p. 13. The others does has 1
non-term symbol.
eg. the following are identical

moses: [X][A] a b [X] ||| c [X][A] d [B] ||| 0.5 ||| 0-1
||| 43.3
cdec/jos/hiero: [B] ||| [A,1] a b ||| c [A,1] d ||| -0.693

if you do write something, please share it.

Chris Dyer

unread,
Dec 1, 2010, 1:49:54 AM12/1/10
to cdec-...@googlegroups.com
Hi Hieu & John,
Hieu's right-the format is a bit different so cdec can't read it. The
necessary script wouldn't be terribly complicated to do the
conversion. I would be happy to add it to the codebase if you write
something.
Best,
-Chris

John Morgan

unread,
Dec 2, 2010, 7:30:23 PM12/2/10
to cdec-...@googlegroups.com
Hi Chris and Hieu,
Here's a script to convert a moses rule table to hiero format.
I used it to converted a 380MB rule table that had been filtered for a
131 segment test file. I ran cdec with it and after translating 3
segments it died with the STDERR given in the other file e.
I've only tried it with a single non-terminal. I don't really have a
rule table with non-terminals decorated with linguisically motivated
syntactic categories.
Should the scores be given in negative log probs?
Could you test this and give me some feedback?
Thanks,
John


--
Regards,
John J Morgan

moses2hiero.pl

Chris Dyer

unread,
Dec 2, 2010, 7:42:33 PM12/2/10
to cdec-...@googlegroups.com
Hi John,
The bad_alloc error usually means you've run out of memory. This could
just be a metter of your machine having limited memory, but it's
entirely possible that cdec might be doing something crazy given your
grammar. If you send me the grammar, I can have a look to see if
something strange is going on.
-C

Chris Dyer

unread,
Dec 2, 2010, 7:56:32 PM12/2/10
to cdec-...@googlegroups.com
And yes, the scores should be in log space (whereas I believe moses
does a log transform when reading the grammars). Note that cdec
supports named parameters, e.g.:

[X] ||| word ||| other_word ||| EgivenF=-12.4 SomeOtherFeature=3

If you don't provide names, they are named PhraseModel_0 PhraseModel_1 ...

-Chris

John Morgan

unread,
Dec 2, 2010, 10:55:54 PM12/2/10
to cdec-...@googlegroups.com
Chris,
Here's a chunk of the grammar.
Thanks,
John
scfg.hiero.gz

Chris Dyer

unread,
Dec 3, 2010, 8:26:06 AM12/3/10
to cdec-...@googlegroups.com
Hi Jonh,
Can you also send me the test sentence that it's failing on? It's
truncated in the output.
-C

John Morgan

unread,
Dec 3, 2010, 8:38:07 AM12/3/10
to cdec-...@googlegroups.com
Explain the importance of early identification of patients at risk for
life-threatening illness or injury and the importance of early
intervention.
اهمیت شناسایی مقدم مریضان معروض به خطر امراض یا صدمات تهدید کننده
حیات و اهمیت مداخله مقدم در این نزد این مریضان را توضیح دهید. اوليه
توضيح در مورد اهميت شناسايى مبتلايان به زندگي در معرض تهديد و زخمي شدن
بيماري و اهميت اوايل گسيل دارد.

I only sent you a small chunk of the grammar.

Chris Dyer

unread,
Dec 3, 2010, 8:45:46 AM12/3/10
to cdec-...@googlegroups.com
The grammar you sent is fine with this sentence- cdec must be using a
lot of memory which accounts for the failure you're seeing. One thing
to pay attention to is that the moses grammar may contain a very long
tail of possible translations for common "words", like "the" or period
(.). These arrise because of noisy alignments, and sometimes there
can be many thousands of them. However, if you filter these by p(e|f)
(say, keep the top 30), you can massively improve performance without
sacrificing quality.

-Chris

John Morgan

unread,
Dec 3, 2010, 3:44:59 PM12/3/10
to cdec-...@googlegroups.com
Chris,
Does the "named parameter" feature you described allow you to add the
information from the lexical translation tables lex.e2f and lex.f2e
files to the grammar? Would you want to do something like this?
jjm
Reply all
Reply to author
Forward
0 new messages