Pialign questions

54 views
Skip to first unread message

Baskaran Sankaran

unread,
Jun 27, 2013, 10:57:10 PM6/27/13
to pialig...@googlegroups.com
Hi Graham/group, 

I ran the inference to collect multiple samples and I intend to combine them using the average trick mentioned in Graham's paper. I guess that I could use the mergephrases.pl script to combine different samples of .pt files. However, this doesn't seem to include the lexical probabilities in the consolidated phrase table.

In the same context, I notice that the lexical probabilities of the same rule are different across samples and I guess this is because they are coming from different sample states. I was wondering how to handle this.

Secondly, I am planning to use '-noqueue' option, which appear to force exhaustive search instead of beam search and I expect the inference to be slower than the beam version. Does anyone has an estimate of how slow this becomes?

Finally I see that pialign takes the following options '-domh' (for doing a Metropolis-Hastings rejection), '-noshuffle' and '-noremnull'. These option sound interesting and I would like to know a bit more about these options to see if I could use any of them.

Thanks
- Baskaran

Graham Neubig

unread,
Jun 27, 2013, 11:38:36 PM6/27/13
to pialig...@googlegroups.com
Hi Baskaran,

Thanks for the mail and your interest in pialign! To answer your questions:

1) I actually don't 100% remember, but I think for the lexicalized tables with multiple samples I simply used "cat" to compile the .samp files together, and ran itgstats.pl to calculate the reordering probabilities from this file.

2) The -noqueue doesn't avoid beam search, but just uses a different data structure during search. The results should be the same with or without this option (although they will be a little different due to the order of sampling). If you want to run exhaustive search, you can set -probwidth to zero. In this case the amount of time necessary will expand O(n^6) in the length of the sentences. This may be possible for sentences of up to 10 words or so, but any longer and it will take forever.

3) As pialign's search is approximate, it is not sampling directly from the true probability distribution. One way to fix this is by performing a Metropolis-Hastings rejection step after parsing. (See "Bayesian inference for pcfgs via markov chain monte carlo"). However, I have found that this actually hurts accuracy somewhat, so it is not enabled by default.

4) pialign usually performs training by sampling sentences in random order. By using -noshuffle you can sample sentences in corpus order instead. I haven't found a huge difference in accuracy either way, particularly if you run for many iterations, but usually shuffling helps accuracy a little when you are using a small number of iterations.

5) -noremnull was inspired by Section 4.3 of the paper "Sampling Alignment Structure under a Bayesian Translation Model," and causes the model to not remember null alignments to prevent common words from being aligned to "null" all the time. I didn't see a big difference in accuracy either way when using this option though.

Graham


- Baskaran

--
You received this message because you are subscribed to the Google Groups "pialign-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pialign-user...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
 
 

Baskaran Sankaran

unread,
Jun 28, 2013, 1:41:30 AM6/28/13
to pialig...@googlegroups.com
Thanks for detailed reply Graham. I have one more question about the combining multiple samples.

I was actually asking about lexical weighting probabilities that are part of the phrase table (columns 5 and 6 in the pt). When I use the mergephrases.pl script, it only averages the columns 3 (joint prob of e and f) and 4 (avg posterior prob of span) and also recomputes columns 1 and 2 (conditional probabilities p(e|f) and p(f|e)). However this script completely ignore the lexical weighting probabilities that are in cols 5 and 6.

When you say I can run itgstats.pl on the concatenated pt, is that for lexicalized reordering model or for lexical weighting (which is part of TM)?

Cheers
- Baskaran

Graham Neubig

unread,
Jun 28, 2013, 4:07:23 AM6/28/13
to pialig...@googlegroups.com
Hi Baskaran,

Sorry, I misinterpreted your question. When I mentioned itgstats.pl I was talking about the lexicalized reordering model.

The mergephrases.pl script was indeed ignoring the lexical weighting probabilities (a holdover from when pialign didn't directly output lexical weighting probabilities, but I added them with a postprocessing script). I've uploaded a new version to github that should do what you want, so please take a look.

Graham


--

Baskaran Sankaran

unread,
Jun 28, 2013, 4:04:39 PM6/28/13
to pialig...@googlegroups.com
Thanks for the code.

Hmm, so for the lexical weighting you seem to simply copy them from
the existing samples. Thus for a rule sampled more than once, you just
take their estimate from the latest iteration (assuming the user will
specify the files in the same order as they were sampled).

Do you think it is possible to recompute the lexical weights for all
the rules, because this way the estimates would be more consistent.

- Baskaran
> You received this message because you are subscribed to a topic in the
> Google Groups "pialign-users" group.
> To unsubscribe from this topic, visit
> https://groups.google.com/d/topic/pialign-users/a-i-Gv_M4r8/unsubscribe.
> To unsubscribe from this group and all its topics, send an email to

Graham Neubig

unread,
Jun 28, 2013, 9:22:33 PM6/28/13
to pialig...@googlegroups.com
Hi Baskaran,

The lexical weights are calculated directly from Model 1 probabilities, so the should be identical for every iteration.
They seemed to be identical for every iteration in my version of pialign, but if you can find an example where they are not this is a bug and should be fixed.

Graham

Baskaran Sankaran

unread,
Jun 28, 2013, 11:55:10 PM6/28/13
to pialig...@googlegroups.com
Oh ok. I checked it again and they are indeed the same. I must have
compared number in different fields in the pt file. Thanks for the
clarification.

- Baskaran
Reply all
Reply to author
Forward
0 new messages