Re: [joshua-support] unsupervised sentential paraphrasing?

Matt Post

unread,

Jul 18, 2014, 2:15:22 PM7/18/14

to joshua_...@googlegroups.com

Hi He,

Joshua can use the PPDB files as decoding grammars; the grammars are actually provided in Joshua's format.

Currently, what you'd need to do are:

1. Download and install Joshua

2. Build your own English language model separately

3. Tune a Joshua system against a dev set, using whichever paraphrase model you downloaded, and your language model. This could be done with a pipeline command to Joshua like this. If the dev set files are named "inputs/dev.in" and "inputs/dev.out":

$JOSHUA/bin/pipeline.pl \

--source in

--target out

--grammar /path/to/your/grammar \

--no-filter-tm \

--lmfile /path/to/your/english/lm \

--first-step tune \

--last-step tune \

--tune inputs/dev

--joshua-mem 20g \

--threads 4

When that's finished, you'll have a model. I can help you bundle the model and show you how to use it when you've reached that point.

matt

On Jul 16, 2014, at 11:11 PM, hehe.r...@gmail.com wrote:

Hi,

I'm trying to use Joshua to generate sentential paraphrase for machine translation references. I don't need compression, only paraphrasing, especially on the verbs. Do I have to build a dev/tune set first? Can Joshua directly take the PPDB and output the paraphrases? Thanks a lot!

He

--
You received this message because you are subscribed to the Google Groups "Joshua Technical Support" group.
To unsubscribe from this group and stop receiving emails from it, send an email to joshua_suppor...@googlegroups.com.
To post to this group, send email to joshua_...@googlegroups.com.
Visit this group at http://groups.google.com/group/joshua_support.
For more options, visit https://groups.google.com/d/optout.

deepak.s...@gmail.com

unread,

Dec 8, 2015, 12:01:04 PM12/8/15

to Joshua Technical Support

Hi Matt,

I'm trying to use PPDB with Joshua for a paraphrasing requirement in my project. I have installed Joshua and got it to work with the Spanish to English language pack. Now, I want to load the PPDB translation model and test out the text to text generation capabilities.

In this thread, you outline a procedure above. Also in this other thread, an approach with helper files has been provided. Both of these threads are more than a year old. Could you please tell me if these approaches still hold? Or should I be trying something else? Any pointers will be of great help. Thanks a lot!!

Cheers,

Deepak.

Matt Post

unread,

Dec 9, 2015, 9:51:51 AM12/9/15

to joshua_...@googlegroups.com

Hi Deepak,

I'm not sure what has changed exactly, but here's what you would want to do:

- Find a language model you want to use

- Pack the PPDB grammar

- You need to produce a Joshua config file. You can use the attached one as a template, subbing out your PPDB grammar and the LM file. But really, you need to tune the model parameters to whatever task you're trying to paraphrase for

I would really like to release PPDB "language packs" (along the lines of Joshua's language packs) so that people could just run them as black boxes, but haven't had time to put together.

matt

joshua.config

deepak.s...@gmail.com

unread,

Dec 9, 2015, 11:45:27 AM12/9/15

to Joshua Technical Support

Hi Matt, thanks for responding.

I will try the steps you've mentioned.

Couple of questions:

1. In the joshua.config file, I see a grammar.glue file being used. Where do I get that?

2. I'm attempting sentence compression (English language), much like what Juri has outlined in his paper. I'm new to this field and am finding it difficult to get data to tune the model parameters. In the paper, the following is written -

"Beginning with 9570 tuples of parallel English–English sentences obtained from multiple reference translations for machine transla- tion evaluation, we construct a parallel compression corpus by selecting the longest reference in each tu- ple as the source sentence and the shortest reference as the target sentence. We further retain only those sentence pairs where the compression rate cr falls in the range 0.5 < cr ≤ 0.8. From these, we randomly select 936 sentences for the development set, as well as 560 sentences for a test set that we use to gauge the performance of our system."

Should I do something similar to get this tuning data set? Please let me know if this data set is available somewhere for me to use. If you think I should try some other approach, please let me know.

Thanks a lot for your time!

Cheers,

Deepak.

Matt Post

unread,

Dec 9, 2015, 12:01:18 PM12/9/15

to joshua_...@googlegroups.com

I just created a script to generate the glue grammar:

$JOSHUA/scripts/support/create_glue_grammar.sh /path/to/your/grammar

But you don't need to do this if you use the pipeline script.

For the sentence compression data, you will want to email Juri (and perhaps his coauthors --- or perhaps he'll respond here) for a pointer to the data. Then you can use Joshua like this:

$JOSHUA/bin/pipeline.pl --rundir 1 \

--source en1 -target en --type samt \

--first-step tune --last-step tune \

--grammar /path/to/PPDB --lmfile /path/to/your/LM \

--tune /path/to/tuning/set/prefix \

--no-filter-tm --joshua-mem 16GB

I think that's it, but there might be more, try that and see how it goes (here, the source language is set to en1, since these are appended to your tuning set prefix, so you can't have en for both)

When tuning is done, you'll have a model file which you can pack. I can help you when you get there. That lets you then easily use Joshua as a black box.

matt

deepak.s...@gmail.com

unread,

Dec 10, 2015, 3:34:21 AM12/10/15

to Joshua Technical Support

Thanks Matt, for the script to generate glue grammar. I have emailed Juri (cc co-authors) about sentence compression data set; hope to hear from him. Meanwhile, I'm reading up more on the theory behind Joshua and PPDB. I will keep you posted.