script for merging language models

58 views
Skip to first unread message

Luke Orland

unread,
Mar 19, 2013, 7:07:46 PM3/19/13
to joshua_d...@googlegroups.com
We are in the process of adding a new script to the Joshua toolkit that uses some SRILM tools to combine multiple ARPA-formatted LMs into a single LM.  Each input language model is assigned an interpolation weight based on perplexity measurements on development text.

Matt Post gained 0.7 BLEU points on an experiment by merging several LMs.

The script is currently at https://github.com/lukeorland/joshua/blob/feature.merge_lms/scripts/support/merge_lms.py (click on the "Raw" button to get the download URL). Soon we will add options to pipeline.pl for using this script on all the language models included with the --lmfile option.

Call it with a path to each of the input language model, a path to the development text, and a path to the desired location where the merged LM will be written. If your SRILM tools are not compiled to $SRILM/bin/i686-m64, in your environment, then you must specify the alternative path with the --srilm-bin option.
The merging process requires enough memory to store the resulting merged LM in memory.

usage: ./merge_lms.py [-h] [--temp-dir TEMP_DIR]
                                      [--srilm-bin SRILM_BIN]
                                      input_lms [input_lms ...] dev_text
                                      merged_lm_path

Merge multiple language models into a single one.

positional arguments:
  input_lms             paths to language models to be merged
  dev_text              path to a file that will be used for calculating
                        interpolation weights for the language models to be
                        merged
  merged_lm_path        path to where the output merged LM file should be
                        written

optional arguments:
  -h, --help            show this help message and exit
  --temp-dir TEMP_DIR   path to the directory where perplexity calculations
                        will be stored. ".//Users/orluke/workspace/mt/joshua
                        /merge-lms-tmp/" is the default location. The temp dir
                        is not automatically deleted.
  --srilm-bin SRILM_BIN
                        path to where the srilm tool's binaries have been
                        compiled. By default this is "$SRILM/bin/i686-m64".
                       
Interpolation of the input models is based on a perplexity measurement of each
model against the development text.





Reply all
Reply to author
Forward
0 new messages