Error when running prepare_lang.sh. sym2int.pl undefined symbol EY

joesm...@gmail.com

unread,

Mar 24, 2016, 8:46:01 AM3/24/16

to kaldi-help

I'm trying to train a model using part of the xm2vts dataset. This consists of the digits 0-9.
My machine is running Ubuntu.

I've created the necessary files following the data preparation instructions and Wit Zieliński's tutorial.

When I run utils/prepare_lang.sh it checks the files outputs a success message and the gives an error related to the FstCompiler.

The error I'm getting is:
sym2int.pl: undefined symbol EY_B (in position 3)
FATAL: FstCompiler: Symbol "EY_B" is not mapped to any integer arc ilabel, symbol table = data/local/phones.txt, source = standard input, line = 4
ERROR: FstHeader::Read: Bad FST header: standard input

if I include !Sil sil in the lexicon the undefined symbol changes to SIL_S while the FstCompiler error still shows EY_B as not being mapped.
Similarly if I use <UNK> SPN the undefined symbol changes to SPN_S while the FstCompiler error still shows EY_B as not being mapped.

From similar questions I have seen posted it is suggest that this problem is a result of an issue with the user generated files so I have checked over these and can't find anything missing.
In Wit Zieliński's tutorial digits are also being used, so I tried running this again but changing the contents of the user generated files to exactly match what is shown in the tutorial however this results in the same error messages being shown.

You can get my files from here:
https://www.dropbox.com/s/dstpqtq9f3pzys1/KaldiTrain.zip?dl=0

The tutorial I am referring to can be found here:
https://groups.google.com/group/kaldi-help/attach/66955a58bf0c2/Kaldi%20for%20dummies%20-%20FIXED?part=0.1&authuser=0&view=1

Daniel Povey

unread,

Mar 24, 2016, 1:33:21 PM3/24/16

to kaldi-help

Please show us more context for the error, i.e. more of the output on the screen.

dan

--
You received this message because you are subscribed to the Google Groups "kaldi-help" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kaldi-help+...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

joesm...@gmail.com

unread,

Apr 4, 2016, 5:07:58 AM4/4/16

to kaldi-help, dpo...@gmail.com

The output is:

Checking dict/silence_phones.txt ...
--> reading dict/silence_phones.txt
--> dict/silence_phones.txt is OK

Checking dict/optional_silence.txt ...
--> reading dict/optional_silence.txt
--> dict/optional_silence.txt is OK

Checking dict/nonsilence_phones.txt ...
--> reading dict/nonsilence_phones.txt
--> dict/nonsilence_phones.txt is OK

Checking disjoint: silence_phones.txt, nonsilence_phones.txt
--> disjoint property is OK.

Checking dict/lexicon.txt
--> reading dict/lexicon.txt
--> dict/lexicon.txt is OK

Checking dict/lexiconp.txt
--> reading dict/lexiconp.txt
--> dict/lexiconp.txt is OK

Checking lexicon pair dict/lexicon.txt and dict/lexiconp.txt
--> lexicon pair dict/lexicon.txt and dict/lexiconp.txt match

Checking dict/extra_questions.txt ...
--> dict/extra_questions.txt is empty (this is OK)
--> SUCCESS [validating dictionary directory dict]

sym2int.pl: undefined symbol EY_B (in position 3)
FATAL: FstCompiler: Symbol "EY_B" is not mapped to any integer arc ilabel, symbol table = data/local/phones.txt, source = standard input, line = 4
ERROR: FstHeader::Read: Bad FST header: standard input

Ruoho Ruotsi

unread,

Apr 4, 2016, 1:35:44 PM4/4/16

to kaldi-help, dpo...@gmail.com, joesm...@gmail.com

I think the formatting (delimiting chars?) of your lexicon has somehow gone sideways, such that the generated phones.txt is all wrong. I looked at your phone_map.txt & phones.txt and I see that each phone part has a unique ID. I'm pretty sure that is not expected by sym2int.pl.

<snip>

EY 114

_B 115

EY 116

_E 117

EY 118

<snip>

Whereas my phones.txt looks like

ey_B 69

ey_E 70

ey_I 71

ey_S 72

Your lexicon.txt looks like and I think those are tabs ... when I did Wit's tutorial, I just used spaces.

<snip>

NINE N AY N

ONE W AH N

SEVEN S EH V AH N

SIX S IH K S

<snip>

I'm not as versed in the language modeling scripts, so I'll let you debug from here.

Daniel Povey

unread,

Apr 4, 2016, 2:29:41 PM4/4/16

to Ruoho Ruotsi, kaldi-help, joesm...@gmail.com

When I look at your dict/phone_map.txt (which is probably generated by prepare_lang.sh), I see that is has ^M characters in it:

SIL^M SIL^M SIL^M_B SIL^M_E SIL^M_I SIL^M_S

Z^M Z^M_B Z^M_E Z^M_I Z^M_S

IY^M IY^M_B IY^M_E IY^M_I IY^M_S

that file seems to be being generated by a perl script in prepare_lang.sh.

It's odd because ^M's don't seem to appear in your input files- although it's possible that they were removed afterward by some mysterious process.

It's hard for me to debug this without being on your system-- I'd guess that your perl version might be a bit messed up, e.g. you are using a Windows version of perl from cygwin and the binary/text mode is getting confused. What version of perl are you using?

Dan

joesm...@gmail.com

unread,

Apr 5, 2016, 9:56:47 AM4/5/16

to kaldi-help, ruoho....@gmail.com, joesm...@gmail.com, dpo...@gmail.com

When I run perl -version I get
This is perl 5, version 18, subversion 2 (v5.18.2) built for x86_64-linux-gnu-thread-multi

In case this makes a difference my OS is Ubuntu 14.04.4 LTS

Daniel Povey

unread,

Apr 5, 2016, 12:55:04 PM4/5/16

to joesm...@gmail.com, kaldi-help, Ruoho Ruotsi

Maybe there were ^M's in your input files but at some point along the way they were stripped out. I don't have time to debug this right now, but it has something to do with ^M's. Try to figure out where they are getting inserted.

Dan

ozi samur

unread,

Apr 24, 2016, 1:05:11 AM4/24/16

to kaldi-help, joesm...@gmail.com, ruoho....@gmail.com, dpo...@gmail.com

Is there any solution you can offer? I am getting EXACTLY the same error. As far as I know there is no limit to use UTF-8 characters in lexicon.txt, corpus.txt etc. files. Am I wrong ?

I see the ^M characters in the data/local/lang/phones_map when I look vi but I checked all the input files and there is no any ^M characters. What should I do ?

The error I am getting is below:

**Creating data/local/dict/lexiconp.txt from data/local/dict/lexicon.txt

sym2int.pl: undefined symbol A_B (in position 3)

FATAL: FstCompiler: Symbol "A_B" is not mapped to any integer arc ilabel, symbol table = data/lang/phones.txt, sourc

e = standard input, line = 8

ERROR: FstHeader::Read: Bad FST header: standard input

5 Nisan 2016 Salı 19:55:04 UTC+3 tarihinde Dan Povey yazdı:

Daniel Povey

unread,

Apr 24, 2016, 1:08:30 AM4/24/16

to ozi samur, kaldi-help, joesm...@gmail.com, Ruoho Ruotsi

It's hard for me to comment on this without knowing precisely what
commands you were running. What was the first command that generated
a file with ^M's in it? Please be specific, with URLs, file
locations, etc.
Dan

ozi samur

unread,

Apr 24, 2016, 1:21:16 AM4/24/16

to kaldi-help, ozis...@gmail.com, joesm...@gmail.com, ruoho....@gmail.com, dpo...@gmail.com

I am getting this error while run.sh tries to run prepare_lang.sh. My perl version is below:

This is perl 5, version 20, subversion 2 (v5.20.2) built for x86_64-linux-gnu-thread-multi

(with 40 registered patches, see perl -V for more detail)

My local is (LC_ALL) is equal to C. And I am using the UTF-8 characters in lexicon, corpus etc. Does it cause the problem ?

24 Nisan 2016 Pazar 08:08:30 UTC+3 tarihinde Dan Povey yazdı:

Daniel Povey

unread,

Apr 24, 2016, 1:24:15 AM4/24/16

to ozi samur, kaldi-help, joesm...@gmail.com, Ruoho Ruotsi

You haven't said what run.sh it was, but no matter.
The problem is not UTF-8.
Almost certainly you have ^M characters in your input, but vi is going
to some kind of dos mode and not displaying them.
Search for what program added those ^M characters, and kill it with fire.
You can use dos2unix to strip out the ^M characters.
Dan

ozi samur

unread,

Apr 24, 2016, 1:27:18 AM4/24/16

to kaldi-help, ozis...@gmail.com, joesm...@gmail.com, ruoho....@gmail.com, dpo...@gmail.com

The error I am getting is here :

**Creating data/local/dict/lexiconp.txt from data/local/dict/lexicon.txt

sym2int.pl: undefined symbol A_B (in position 3)

FATAL: FstCompiler: Symbol "A_B" is not mapped to any integer arc ilabel, symbol table = data/lang/phones.txt, sou

rce = standard input, line = 8

ERROR: FstHeader::Read: Bad FST header: standard input

24 Nisan 2016 Pazar 08:21:16 UTC+3 tarihinde ozi samur yazdı:

Message has been deleted

ozi samur

unread,

Apr 24, 2016, 3:06:46 AM4/24/16

to kaldi-help, ozis...@gmail.com, joesm...@gmail.com, ruoho....@gmail.com, dpo...@gmail.com

When I apply dos2unix to nonsilence_phones, silence_phones, optional_silence_phones then this error has gone away.

Thanks Dan.

24 Nisan 2016 Pazar 09:19:50 UTC+3 tarihinde ozi samur yazdı:

I tried to convert by using dos2unix, it changes some of the UTF-8 characters but I am getting the same error again.
My converted lexicon.txt (by using dos2unix) file has been attached. What do you think?

24 Nisan 2016 Pazar 08:27:18 UTC+3 tarihinde ozi samur yazdı:

Reply all

Reply to author

Forward

Error when running prepare_lang.sh. sym2int.pl undefined symbol EY_B

joesm...@gmail.com

Daniel Povey

joesm...@gmail.com

Ruoho Ruotsi

Daniel Povey

joesm...@gmail.com

Daniel Povey

ozi samur

Daniel Povey

ozi samur

Daniel Povey

ozi samur

ozi samur