Problems with generating the G.fst file...

K.R

unread,

Dec 17, 2016, 11:10:41 AM12/17/16

to kaldi-help

I create the file as such



echo
echo "===== MAKING G.fst ====="
echo
lang=data/lang
ls ${lang}


cat $local/tmp/lm.arpa | arpa2fst - | fstprint | utils/eps2disambig.pl |
utils/s2eps.pl | fstcompile --isymbols=$lang/words.txt --osymbols=$lang/words.txt --keep_isymbols=false --keep_osymbols=false | 
fstrmepsilon | fstarcsort --sort_type=ilabel > $lang/G.fst

I seem to have some problems when i run this command..

I am getting this error message:



===== MAKING G.fst =====


L.fst  L_disambig.fst  oov.int oov.txt  phones  phones.txt  topo  words.txt
arpa2fst - 
LOG (arpa2fst:Read():arpa-file-parser.cc:90) Reading \data\ section.
LOG (arpa2fst:Read():arpa-file-parser.cc:145) Reading \1-grams: section.
LOG (arpa2fst:Read():arpa-file-parser.cc:145) Reading \2-grams: section.
LOG (arpa2fst:Read():arpa-file-parser.cc:145) Reading \3-grams: section.
FATAL: FstCompiler: Symbol "APOSTROPHE" is not mapped to any integer arc ilabel, symbol table = data/lang/words.txt, source = standard input, line = 5
ERROR: FstHeader::Read: Bad FST header: standard input
ERROR: FstHeader::Read: Bad FST header: standard input

Error being it cannot find "APOSTROPHE" in words.txt, which makes sense since it not there, but the word isn't listed in either training set of the data

or in the lexicon, as it has been filtered to only contain words which the training set contains.

The word is in the test set, and all the words which are listed in the test set aren't used for the filtering of the lexicon, as it is pretty unlikely that a word

from the test set isn't in train set. So... How does it know about the existence of the word?... As far i can see from the script it only looks at and words.txt...

the words.txt in data/lang is created by

utils/prepare_lang.sh data/local/lang '<oov>' data/local/lang data/lang

Daniel Povey

unread,

Dec 17, 2016, 5:08:06 PM12/17/16

to kaldi-help

Where did you get that command to convert the arpa into FST? That
doesn't look right, and it isn't handling OOVs correctly.
The correct command is something like what's in utils/format_lm.sh, like:

gunzip -c $lm \
| arpa2fst --disambig-symbol=#0 \
--read-symbol-table=$out_dir/words.txt - $out_dir/G.fst

Dan

> --
> You received this message because you are subscribed to the Google Groups
> "kaldi-help" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to kaldi-help+...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

K.R

unread,

Dec 17, 2016, 6:35:28 PM12/17/16

to kaldi-help, dpo...@gmail.com

I got it from here..

http://www.dsp.agh.edu.pl/_media/pl:dydaktyka:kaldi_for_dummies_-_fixed.pdf

So i should replace the full command with that..
I resolved my issue by merging list of words in test and train together to one word list, and filter my lexicon with that..

Reply all

Reply to author

Forward