script for merging language models

58 views

Skip to first unread message

Luke Orland

unread,

Mar 19, 2013, 7:07:46 PM3/19/13

to joshua_d...@googlegroups.com

We are in the process of adding a new script to the Joshua toolkit that uses some SRILM tools to combine multiple ARPA-formatted LMs into a single LM. Each input language model is assigned an interpolation weight based on perplexity measurements on development text.

Matt Post gained 0.7 BLEU points on an experiment by merging several LMs.

The script is currently at https://github.com/lukeorland/joshua/blob/feature.merge_lms/scripts/support/merge_lms.py (click on the "Raw" button to get the download URL). Soon we will add options to pipeline.pl for using this script on all the language models included with the --lmfile option.

Call it with a path to each of the input language model, a path to the development text, and a path to the desired location where the merged LM will be written. If your SRILM tools are not compiled to $SRILM/bin/i686-m64, in your environment, then you must specify the alternative path with the --srilm-bin option.

The merging process requires enough memory to store the resulting merged LM in memory.

usage: ./merge_lms.py [-h] [--temp-dir TEMP_DIR]

[--srilm-bin SRILM_BIN]

input_lms [input_lms ...] dev_text

merged_lm_path

Merge multiple language models into a single one.

positional arguments:

input_lms paths to language models to be merged

dev_text path to a file that will be used for calculating

interpolation weights for the language models to be

merged

merged_lm_path path to where the output merged LM file should be

written

optional arguments:

-h, --help show this help message and exit

--temp-dir TEMP_DIR path to the directory where perplexity calculations

will be stored. ".//Users/orluke/workspace/mt/joshua

/merge-lms-tmp/" is the default location. The temp dir

is not automatically deleted.

--srilm-bin SRILM_BIN

path to where the srilm tool's binaries have been

compiled. By default this is "$SRILM/bin/i686-m64".

Interpolation of the input models is based on a perplexity measurement of each

model against the development text.

Reply all

Reply to author

Forward

0 new messages