Can't build a language model for Amharic language data

367 views
Skip to first unread message

Habush Samireh

unread,
Nov 21, 2016, 8:47:30 AM11/21/16
to kaldi-help

I'm working in a project that requires voice recognition for Amharic. For the language data required to train the ASR, I use the data available this github repo. There are some scripts provided to train using the data. However, when I try to build the language model using the script provided I get the following error:


The script I run is the following:

$KALDI_ROOT/tools/srilm/bin/i686-m64/ngram-count -order 5 -text lang/dict/lexicon.txt -lm lm/amharic.train.lm.data.arpa -unk -kndiscount1 -kndiscount2 -kndiscount3 -kndiscount4 -kndiscount5 -gt1min 1 -gt2min 1 -gt3min 1 -gt4min 1 -gt5min 1

#convert to FST format for Kaldi
cat lm/amharic.train.lm.data.arpa | $KALDI_ROOT/egs/wsj/s5/utils/find_arpa_oovs.pl lang/words.txt  > lang/oovs.txt

cat lm/amharic.train.lm.data.arpa | grep -v '<s> <s>' | grep -v '</s> <s>' | grep -v '</s> </s>' | arpa2fst - | fstprint | $KALDI_ROOT/egs/wsj/s5/utils/remove_oovs.pl lang/oovs.txt | $KALDI_ROOT/egs/wsj/s5/utils/eps2disambig.pl | $KALDI_ROOT/egs/wsj/s5/utils/s2eps.pl | fstcompile --isymbols=lang/words.txt --osymbols=lang/words.txt  --keep_isymbols=false --keep_osymbols=false | fstrmepsilon > lm/G.fst

#add fst sort arc tools/openfst/bin/arcsort to solve the problem of "ERROR: data/lang/G.fst is not ilabel sorted"
fstarcsort  --sort_type=ilabel lm/G.fst lm/newG.fst
mv lm/newG.fst lang/
mv lang/newG.fst lang/G.fst
#utils/validate_lang.pl lang



Could you guys please take a look at the repo and tell me what I am missing here? Thanks in advance

Daniel Povey

unread,
Nov 21, 2016, 2:14:24 PM11/21/16
to kaldi-help
Firstly, it looks like your Kaldi repo is out of date.  But the real problem is probably that either ARPA the file got truncated at that line number, or maybe it had an unexpected blank line.
You can cat the file and pipe into
tail -n 7473990 | tail -n 10
to see the context

--
You received this message because you are subscribed to the Google Groups "kaldi-help" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kaldi-help+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Message has been deleted

Habush Samireh

unread,
Nov 22, 2016, 8:03:32 AM11/22/16
to kaldi-help, dpo...@gmail.com
Here is the output of after running the commands you suggested( I added line numbers for clarification) :


As you can see there is an space in the 9th number and '<s>'  character in the 10th. Could it be because of those lines? Should I edit or remove them?
To unsubscribe from this group and stop receiving emails from it, send an email to kaldi-help+...@googlegroups.com.
Message has been deleted

Jan Trmal

unread,
Nov 22, 2016, 9:13:16 AM11/22/16
to kaldi-help
Please refrain from spamming the list by asking the same question again and again.
Also, you didn't say if you have the latest kaldi -- our guess is you do not, because the error information indicates different line than in the latest version.
Update your kaldi and run the command again.
y.

On Tue, Nov 22, 2016 at 8:30 AM, Habush Samireh <hsam...@gmail.com> wrote:
--
You received this message because you are subscribed to the Google Groups "kaldi-help" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kaldi-help+unsubscribe@googlegroups.com.

Habush Samireh

unread,
Nov 22, 2016, 9:25:22 AM11/22/16
to kaldi-help
I am sorry for repeatedly posting the same reply. It was by accident and it won't happen again. As to the kaldi version, I downloaded kaldi from the github repo found here about two weeks ago. Isn't that the latest version?
To unsubscribe from this group and stop receiving emails from it, send an email to kaldi-help+...@googlegroups.com.

Jan Trmal

unread,
Nov 22, 2016, 9:26:40 AM11/22/16
to kaldi-help
No, there was an update to the arpa2fst code couple of days back.
y.

To unsubscribe from this group and stop receiving emails from it, send an email to kaldi-help+unsubscribe@googlegroups.com.

Habush Samireh

unread,
Nov 22, 2016, 9:30:00 AM11/22/16
to kaldi-help
Okay. I'm downloading the latest version from the repo. I will run the commands and let you guys know how it went.

Daniel Povey

unread,
Nov 22, 2016, 4:38:40 PM11/22/16
to Habush Samireh, kaldi-help
Sorry that command was wrong, it should have been
head -n 7473990 | tail -n 10



On Tue, Nov 22, 2016 at 8:03 AM, Habush Samireh <hsam...@gmail.com> wrote:
Here is the output of after running the commands you suggested( I added line numbers for clarification) :


As you can see there is an space in the 9th number and '<s>'  character in the 10th. Could it be because of those lines? Should I edit or remove them?
On Monday, November 21, 2016 at 10:14:24 PM UTC+3, Dan Povey wrote:
To unsubscribe from this group and stop receiving emails from it, send an email to kaldi-help+...@googlegroups.com.

Daniel Povey

unread,
Nov 22, 2016, 4:39:53 PM11/22/16
to Habush Samireh, kaldi-help
... Actually, forget it... it's clear from your output that your ARPA file is truncated, which is what the Kaldi program was complaining about.  It should end with a blank line and then \end\.  But don't add it manually, it's missing other stuff too.  Possibly something went wrong in SRILM while it was being created.  Check for error messages earlier on, from SRILM.


Habush Samireh

unread,
Nov 23, 2016, 7:21:04 AM11/23/16
to kaldi-help, hsam...@gmail.com, dpo...@gmail.com
After running ngram-count from srilm, I get the following error:

one of modified KneserNey discounts is negative
error in discount estimator for order 1

Also, I should mention that in 03_LM.sh script, which contains the commands to run srilm tools, there is a line which uses a nonexistent file as argument. The original line is this:
/home/melese/toolkit/srilm/bin/i686-m64/ngram-count -order 5 -text lm/amharic.lm.data.segmented -lm lm/amharic.train.lm.data.arpa -unk -kndiscount1 -kndiscount2 -kndiscount3 -kndiscount4 -kndiscount5 -gt1min 1 -gt2min 1 -gt3min 1 -gt4min 1 -gt5min 1
 

However, the file in the -text argument amharic.lm.data.segmented is nonexistent. To make it work I replaced it with the lang/dict/lexicon.txt file. I have opened an issue in the git repository about the missing file.
Could that be the problem?

Laurent Besacier

unread,
Nov 23, 2016, 11:13:00 AM11/23/16
to kaldi-help
Hello

I think you had a problem to create the language model for the following reason:
The file amharic.lm.data.segmented does not exist because it was too big to be transfered on the repo
So there are two zip files instead
You should first unzip these two files with .zip extension and then concatenate then and renaméthem as amharic.lm.data.segmented and run again scripts from this step

It should be ok then

We are going to patch the repo accordingly

Best

L

Habush Samireh

unread,
Nov 27, 2016, 11:53:52 AM11/27/16
to kaldi-help
I extracted the zip files and concatenated them into single file. Then I provided the file as an argument to the -text option of ngram-count. However, I am getting the following error:

one of modified KneserNey discounts is negative
error in discount estimator for order 2

Jan Trmal

unread,
Nov 27, 2016, 12:37:58 PM11/27/16
to kaldi-help
those errors are non-fatal, usually.
y.

--
You received this message because you are subscribed to the Google Groups "kaldi-help" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kaldi-help+unsubscribe@googlegroups.com.

Daniel Povey

unread,
Nov 27, 2016, 2:41:05 PM11/27/16
to kaldi-help
If you get errors like that are fatal, you have to use a different
smoothing method for the order in question, like Good-Turing. I think
you can, for instance, remove the -kndiscount1 option if you get a
fatal error like that for 1-gram, and it will use Good-Turing instead.
Dan
>> email to kaldi-help+...@googlegroups.com.
>> For more options, visit https://groups.google.com/d/optout.
>
>
> --
> You received this message because you are subscribed to the Google Groups
> "kaldi-help" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to kaldi-help+...@googlegroups.com.

Danijel Korzinek

unread,
Nov 27, 2016, 3:24:05 PM11/27/16
to kaldi-help
KN smoothing doesn't work if wouhave very little data. Witten Bell is recommended in those cases. Try to add the following on the command line:

-wbdiscount -gt1min 1 -gt2min 1 -gt3min 1

Habush Samireh

unread,
Nov 29, 2016, 6:20:57 AM11/29/16
to kaldi-help
When I run it by removing the KN smoothing, the Kernel kills ngram-count due to Out of memory error. Here is a screenshot of the last ten line of the kernel log (found in /var/log/kern.log):




The hardware specification of the machine is:
  • CPU: 1 core (Intel(R) Xeon(R) CPU E5-2673 v3 @ 2.40GHz)
  • RAM: 3.5GB
  • HDD: 50GB 
I am using this machine only for training kaldi, i.e, no other programs are running other than system processes. How can I resolve this?

Daniel Povey

unread,
Nov 29, 2016, 5:18:12 PM11/29/16
to kaldi-help
Probably you just had too many other things running at the same time.  Kaldi is normally run on clusters of large machines, not on a single machine, and you need to watch how many processes you run at the same time if you don't have GridEngine installed.
Dan


--
You received this message because you are subscribed to the Google Groups "kaldi-help" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kaldi-help+unsubscribe@googlegroups.com.

Habush Samireh

unread,
Dec 5, 2016, 12:25:05 PM12/5/16
to kaldi-help, dpo...@gmail.com
After installing kaldi on another machine with more processing power and memory, I am facing a new error. The error is:
FstHeader::Read:Bad FST header: standard input
  Error while loading shared libraries: libkaldi-fstext.so..

I have installed kaldi same way I did on other machines, however, I didn't encounter such error. What might be the cause? 
To unsubscribe from this group and stop receiving emails from it, send an email to kaldi-help+...@googlegroups.com.

Daniel Povey

unread,
Dec 5, 2016, 12:41:23 PM12/5/16
to Habush Samireh, kaldi-help
Probably you moved the location of the Kaldi installation after you installed it.  The library search path is included in  the binary.

Habush Samireh

unread,
Dec 13, 2016, 12:52:07 PM12/13/16
to kaldi-help, hsam...@gmail.com, dpo...@gmail.com
Currently, I'm facing the following error while building the language model.

ERROR: VectorFst::Read: unexpected end of file: standard input

 
Here is the screenshot of the error:



How can I solve this?

Daniel Povey

unread,
Dec 13, 2016, 1:54:34 PM12/13/16
to Habush Samireh, kaldi-help
That's a very generic error that happens when an FST was truncated in the middle.  Hard to debug with so little context.  You should probably modify the script to write the elements of that pipe to temporary files, which will make it easier to figure out what is happening.  Could be that some program was killed by the Linux OOM killer.

Dan

Reply all
Reply to author
Forward
0 new messages