We are in the process of adding a new script to the Joshua toolkit that uses some SRILM tools to combine multiple ARPA-formatted LMs into a single LM. Each input language model is assigned an interpolation weight based on perplexity measurements on development text.
Matt Post gained 0.7 BLEU points on an experiment by merging several LMs.
Call it with a path to each of the input language model, a path to the development text, and a path to the desired location where the merged LM will be written. If your SRILM tools are not compiled to $SRILM/bin/i686-m64, in your environment, then you must specify the alternative path with the --srilm-bin option.
The merging process requires enough memory to store the resulting merged LM in memory.
usage: ./merge_lms.py [-h] [--temp-dir TEMP_DIR]
[--srilm-bin SRILM_BIN]
input_lms [input_lms ...] dev_text
merged_lm_path
Merge multiple language models into a single one.
positional arguments:
input_lms paths to language models to be merged
dev_text path to a file that will be used for calculating
interpolation weights for the language models to be
merged
merged_lm_path path to where the output merged LM file should be
written
optional arguments:
-h, --help show this help message and exit
--temp-dir TEMP_DIR path to the directory where perplexity calculations
will be stored. ".//Users/orluke/workspace/mt/joshua
/merge-lms-tmp/" is the default location. The temp dir
is not automatically deleted.
--srilm-bin SRILM_BIN
path to where the srilm tool's binaries have been
compiled. By default this is "$SRILM/bin/i686-m64".
Interpolation of the input models is based on a perplexity measurement of each
model against the development text.