Problems with generating the G.fst file...

901 views
Skip to first unread message

K.R

unread,
Dec 17, 2016, 11:10:41 AM12/17/16
to kaldi-help
I create the file as such


echo
echo
"===== MAKING G.fst ====="
echo
lang
=data/lang
ls $
{lang}


cat $local
/tmp/lm.arpa | arpa2fst - | fstprint | utils/eps2disambig.pl |
utils
/s2eps.pl | fstcompile --isymbols=$lang/words.txt --osymbols=$lang/words.txt --keep_isymbols=false --keep_osymbols=false |
fstrmepsilon
| fstarcsort --sort_type=ilabel > $lang/G.fst




I seem to have some problems when i run this command.. 

I am getting this error message: 



===== MAKING G.fst =====


L
.fst  L_disambig.fst  oov.int oov.txt  phones  phones.txt  topo  words.txt
arpa2fst
-
LOG
(arpa2fst:Read():arpa-file-parser.cc:90) Reading \data\ section.
LOG
(arpa2fst:Read():arpa-file-parser.cc:145) Reading \1-grams: section.
LOG
(arpa2fst:Read():arpa-file-parser.cc:145) Reading \2-grams: section.
LOG
(arpa2fst:Read():arpa-file-parser.cc:145) Reading \3-grams: section.
FATAL
: FstCompiler: Symbol "APOSTROPHE" is not mapped to any integer arc ilabel, symbol table = data/lang/words.txt, source = standard input, line = 5
ERROR
: FstHeader::Read: Bad FST header: standard input
ERROR
: FstHeader::Read: Bad FST header: standard input



Error being it cannot find "APOSTROPHE" in words.txt, which makes sense since it not there, but the word isn't listed in either training set of the data
or in the lexicon, as it has been filtered to only contain words which the training set contains. 

The word is in the test set, and all the words which are listed in the test set aren't used for the filtering of the lexicon, as it is pretty unlikely that a word
from the test set isn't in train set. So... How does it know about the existence of the word?... As far i can see from the script it only looks at and words.txt...

the words.txt in data/lang is created by 
utils/prepare_lang.sh data/local/lang '<oov>' data/local/lang data/lang


Daniel Povey

unread,
Dec 17, 2016, 5:08:06 PM12/17/16
to kaldi-help
Where did you get that command to convert the arpa into FST? That
doesn't look right, and it isn't handling OOVs correctly.
The correct command is something like what's in utils/format_lm.sh, like:

gunzip -c $lm \
| arpa2fst --disambig-symbol=#0 \
--read-symbol-table=$out_dir/words.txt - $out_dir/G.fst


Dan
> --
> You received this message because you are subscribed to the Google Groups
> "kaldi-help" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to kaldi-help+...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

K.R

unread,
Dec 17, 2016, 6:35:28 PM12/17/16
to kaldi-help, dpo...@gmail.com
I got it from here.. 


So i should replace the full command with that.. 
I resolved my issue by merging list of words in test and train together to one word list, and filter my lexicon with that.. 
Reply all
Reply to author
Forward
0 new messages