I create the file as such
echo
echo "===== MAKING G.fst ====="
echo
lang=data/lang
ls ${lang}
cat $local/tmp/lm.arpa | arpa2fst - | fstprint | utils/eps2disambig.pl |
utils/s2eps.pl | fstcompile --isymbols=$lang/words.txt --osymbols=$lang/words.txt --keep_isymbols=false --keep_osymbols=false |
fstrmepsilon | fstarcsort --sort_type=ilabel > $lang/G.fst
I seem to have some problems when i run this command..
I am getting this error message:
===== MAKING G.fst =====
L.fst L_disambig.fst oov.int oov.txt phones phones.txt topo words.txt
arpa2fst -
LOG (arpa2fst:Read():arpa-file-parser.cc:90) Reading \data\ section.
LOG (arpa2fst:Read():arpa-file-parser.cc:145) Reading \1-grams: section.
LOG (arpa2fst:Read():arpa-file-parser.cc:145) Reading \2-grams: section.
LOG (arpa2fst:Read():arpa-file-parser.cc:145) Reading \3-grams: section.
FATAL: FstCompiler: Symbol "APOSTROPHE" is not mapped to any integer arc ilabel, symbol table = data/lang/words.txt, source = standard input, line = 5
ERROR: FstHeader::Read: Bad FST header: standard input
ERROR: FstHeader::Read: Bad FST header: standard input
Error being it cannot find "APOSTROPHE" in words.txt, which makes sense since it not there, but the word isn't listed in either training set of the data
or in the lexicon, as it has been filtered to only contain words which the training set contains.
The word is in the test set, and all the words which are listed in the test set aren't used for the filtering of the lexicon, as it is pretty unlikely that a word
from the test set isn't in train set. So... How does it know about the existence of the word?... As far i can see from the script it only looks at and words.txt...
the words.txt in data/lang is created by
utils/prepare_lang.sh data/local/lang '<oov>' data/local/lang data/lang