Babel Recipe (Cantonese)

590 views
Skip to first unread message

phenix...@gmail.com

unread,
May 30, 2018, 12:31:11 AM5/30/18
to kaldi-help
Hello, 

I was working on Babel Cantonese follow the Kaldi recipe and got some questions as below

    From the result file "results.101-cantonese-ful...@jhu.edu.2016-02-18T121522-0500"
    under folder "kaldi/egs/babel/s5d/results/"
    It is found that the best result is 37% CER with tri6_nnet_mpe setting as below

        # STT Task performance (CER), evaluated on 2016-02-19T22:47:09-0500
        %WER 37.0 | 10001 104181 | 65.6 25.3 9.1 2.6 37.0 72.6 | -0.301 | exp/tri6_nnet_mpe/decode_dev10h.pem_epoch4/score_10/dev10h.pem.char.ctm.sys

    As checked from the script run-1-main.sh, I see that the stm file point to the IndusDB in the comment, which is not come with the babel Cantonese corpus (LDC2016S02).
    So wanna have some more details on what is the train set, dev set and the corresponding transcription/stm file used?

    Based on my checking, the info from corpus is as below
    Dev set:
          # of speaker: 120
          # of utterence (excluded the empty/silence only utt.): 10068
          # of characters (exclude <hes>, <unk>, and  silence utts.): 120395
          total duration (exclude silence utts.): 17.75 hrs

    Train set:
          # of speaker: 952
          # of utterence (excluded the empty/silence only utt.): 79716
          # of characters (exclude <hes>, <unk>, and  silence utts.): 972284
          total duration (exclude silence utts.): 140.64 hrs
          silence dur (from silence utts.):  2246.11 seconds
 
     Which shows that the # of utterence and # of character of the dev set (10068, 120395) are not match with the reference result (10001, 104181).
     If I don't have the exact list, would it possible to got similar result if using same amount of data on training set and dev set (pick randomly), 
     with the transcriptions come with the corpus?
     Can share me the number of data used in training and testing for this purpose?
Thanks alot.
Phenix, 20180530
  

Jan Trmal

unread,
May 30, 2018, 12:38:16 AM5/30/18
to kaldi-help
I'll check but do not expect the response today, I'm traveling. The utterances that are missing might have been removed because they were too short or contained only a noise or something like that. 
Y.  


--
Go to http://kaldi-asr.org/forums.html find out how to join
---
You received this message because you are subscribed to the Google Groups "kaldi-help" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kaldi-help+...@googlegroups.com.
To post to this group, send email to kaldi...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/kaldi-help/7f9a748e-b523-426d-a281-d947d141597c%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

phenix...@gmail.com

unread,
May 30, 2018, 2:24:15 AM5/30/18
to kaldi-help
Hi Yenda, 

Thanks alot for your help ... :)

Thanks.
Best rgds,
Phenix, 20180530


On Wednesday, May 30, 2018 at 12:38:16 PM UTC+8, Yenda wrote:
I'll check but do not expect the response today, I'm traveling. The utterances that are missing might have been removed because they were too short or contained only a noise or something like that. 
Y.  


On Wed, May 30, 2018, 06:31 <phenix...@gmail.com> wrote:
Hello, 

I was working on Babel Cantonese follow the Kaldi recipe and got some questions as below

    From the result file "results.101-cantonese-fullLP.official.c...@jhu.edu.2016-02-18T121522-0500"

Jan Trmal

unread,
May 30, 2018, 2:27:22 AM5/30/18
to kaldi-help
It just occurred to me the results are very old. The utterance counts aside, I'd say, do not waste time to reproduce the numbers (have a look at the local/chain scripts).
Y. 


Jan Trmal

unread,
May 30, 2018, 7:17:45 PM5/30/18
to kaldi-help
The best CER I got (on the chain system) is 29.4 % (scored with the stm)
| Sum/Avg  |10001    104181 | 72.8    19.7     7.5     2.2     29.4    69.9  | -0.130  |
the utterance counts and the character counts seems consistent with the old results, which is a good news, I guess...

The aforementioned CER from SCTK (with the stm) corresponds to the CER
%WER 34.14 [ 41719 / 122214, 6541 ins, 8931 del, 26247 sub ] exp/chain_cleaned_pitch/tdnn_flstm_bab8_sp_bi/decode_dev10h.pem//cer_9_0.0
obtained using the kaldi scoring scripts.
(10077 utterances in the decoded output, 10068 utterances with non-empty (non-sil) transcripts in reference and 122214 charactrs)

The stm contains information that some segments should be skipped during scoring, the lines look like this:
BABEL_BP_101_98675_20111117_190458_outLine 1 98675_B 599.304 600.100 IGNORE_TIME_SEGMENT_IN_SCORING
In total, there are 1455 of such segments, looking at the transcription, they seem to match the lines that contain '(())' (i.e. uninteligible, IIRC), but not sure if that is the only case or if there are some other lines removed.  Also, please note that the segments probably do not correlate well with the utterances.

y.




phenix...@gmail.com

unread,
May 31, 2018, 5:27:16 AM5/31/18
to kaldi-help
Hi Yenda, 

Thanks alot for your help. 
However, I still not able to get similar utterance/character count yet.

1. Is it possible to suggest which script you are using to generate the stm file for the train/dev set from the transcriptions (.txt file)?
    (where the script kaldi/egs/babel/s5c/local/prepare_stm.sh doesn't generate stm file with right utterance/character count.)
    Or possible to share me the stm file u are using for comparison?

2. Would like to know the CER of 29.4% is generated under which settings?
    Is it published somewhere ?

Thanks.
Best rgds,
Phenix, 20180531
Hello, 

    From the result file "results.101-cantonese-fullLP.official.conf.jtr...@jhu.edu.2016-02-18T121522-0500"

Jan Trmal

unread,
May 31, 2018, 10:17:29 AM5/31/18
to kaldi-help
If you are asking about the original stm, we got it from IARPA (we were babel performers).
If you are asking about the kaldi scoring scripts, you don't need any stm, the file data/dev10h.pem/text is using as the reference (plus there are filters in local/wer_output_filter 
y.

phenix...@gmail.com

unread,
Jun 1, 2018, 8:29:14 PM6/1/18
to kaldi-help
Hi Yenda, 

Thanks a lot for your help info and help.

And last thing, for the setup "exp/chain_cleaned_pitch/tdnn_flstm_bab8_sp_bi", 
is it generate from any of the kaldi recipe? if yes, where is the recipe/script located?
from "kaldi/egs/babel/s5d/local/chain/tuning", I only see the latest script "run_tdnn_lstm_bab7xxx.sh" and not found anything like bab8.

Thanks alot and sorry for any trouble.

Best rgds,
Phenix, 20180602

Jan Trmal

unread,
Jun 4, 2018, 7:06:57 PM6/4/18
to kaldi-help
I will have to check and get back to you tmrw.
y.

Jan Trmal

unread,
Jun 5, 2018, 10:03:11 AM6/5/18
to kaldi-help
Hi, I have created a PR here https://github.com/kaldi-asr/kaldi/pull/2476
that contains the changes and the missing bab8 recipe.
BTW, Cantonese has significantly more data than the other babel languages, so it might be that you would be able to tune it to perform (much) better than the official recipes do -- let me know in that case.
y.

phenix...@gmail.com

unread,
Jun 5, 2018, 11:12:51 PM6/5/18
to kaldi-help
Hi Yenda, 

Thanks a lot for your help and info.
Will update on it if got any better performance ... :)

Thanks.
Best rgds,
Phenix, 20180606

bl2...@columbia.edu

unread,
Nov 16, 2018, 2:18:12 PM11/16/18
to kaldi-help
Hi all,

I am currently working with the 101 cantonese dataset, and it seems as though my distribution is formatted differently. The folder name is
101-Delivery-Cantonese-v0.4c
Does anyone know anything about this distrubtion?

And is it possible for me to Babel Kaldi recipe on my dataset? From my understanding, the conf file given depends on a certain machine setup (does it have to be on a JHU setup?). I don't have /export/babel/data/ nor IndusDB, and I was wondering how I could set it up.

Thanks!
Bryan

On Wednesday, May 30, 2018 at 12:31:11 AM UTC-4, phenix...@gmail.com wrote:
Hello, 

I was working on Babel Cantonese follow the Kaldi recipe and got some questions as below

    From the result file "results.101-cantonese-fullLP.official.c...@jhu.edu.2016-02-18T121522-0500"

Jan Trmal

unread,
Nov 16, 2018, 2:28:45 PM11/16/18
to kaldi...@googlegroups.com
You won't be able to get IndusDB, unfortunately, unless you (your university) were a part of the original Babel program.
Ad the other question:
./local/nist_eval/create_new_language_configs.FLP.sh \ --language "101-Delivery-Cantonese-v0.4c" \ --corpus "/path/to/the/language/directory" \ --indus "/export/babel/data/scoring/IndusDB"
The script should be be able to handle non-existing IndusDB (but you won't be able to run KWS and produce oficial scores).
If not, you will have to modify/fix the script
Or generate the config file by hand -- shouldn't be that difficult.
y

--
Go to http://kaldi-asr.org/forums.html find out how to join
---
You received this message because you are subscribed to the Google Groups "kaldi-help" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kaldi-help+...@googlegroups.com.
To post to this group, send email to kaldi...@googlegroups.com.

bl2...@columbia.edu

unread,
Nov 17, 2018, 11:05:08 AM11/17/18
to kaldi-help
Hi Yenda,

Thanks for that command. I managed to run the script by commenting out all the indusDB related lines.

I tried running run-1-main.sh, but I get an error in path.sh:
The file /export/babel/data/software/env.sh is not present -> Exit!

How do I get the BABEL software?

Jan Trmal

unread,
Nov 17, 2018, 11:07:51 AM11/17/18
to kaldi...@googlegroups.com
I think that relates only to f4de, that is used  for scoring of the keyword search, you can safely comment it out, I think.
Y.

bl2...@columbia.edu

unread,
Nov 27, 2018, 9:36:31 PM11/27/18
to kaldi-help
 Hi Yenda,

I was able to finally complete execution of s5d/run-1-main.sh (using only 8 CPUs the feature extraction took many days). I notice that the s5d directory does not have scripts like run-2a-nnet.sh, run-3-bnf-system.sh, and run-4-test.sh, only one for segmentation and one for anydecode. Will running the files in s5d give me the latest results of 29.4%, or should I use s5c?

Jan Trmal

unread,
Nov 27, 2018, 9:43:27 PM11/27/18
to kaldi...@googlegroups.com
Hi, the scripts that are missing wouldn't give you best results, that is the reason we deleted them -- for the system giving best outcome you'd need to look into local/chain (probably run_tdnn_lstm.sh but you might need to try the scripts in local/chain/tuning as well). So s5d is definitely the right recipe to work with.
And hopefully you have some GPUs -- without that, I don't think the training will end within a reasonable time.
HTH
y.

On Tue, Nov 27, 2018 at 9:36 PM <bl2...@columbia.edu> wrote:
 Hi Yenda,

I was able to finally complete execution of s5d/run-1-main.sh (using only 8 CPUs the feature extraction took many days). I notice that the s5d directory does not have scripts like run-2a-nnet.sh, run-3-bnf-system.sh, and run-4-test.sh, only one for segmentation and one for anydecode. Will running the files in s5d give me the latest results of 29.4%, or should I use s5c?

--
Go to http://kaldi-asr.org/forums.html find out how to join
---
You received this message because you are subscribed to the Google Groups "kaldi-help" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kaldi-help+...@googlegroups.com.
To post to this group, send email to kaldi...@googlegroups.com.

bl2...@columbia.edu

unread,
Nov 27, 2018, 11:37:19 PM11/27/18
to kaldi-help
segmentation completed pretty quickly. If I run anydecode (4), can I run the local/chain scripts with a bit of effort?

I am using a GCP VM, so I can provision a P100.

bl2...@columbia.edu

unread,
Nov 28, 2018, 12:38:54 AM11/28/18
to kaldi-help
run-4-any-decode.sh errors out with the status:
"declare -a kwsets='()'
./local/search/run_search.sh: line 40: kwsets[@]: unbound variable"

What should I do?

Jan Trmal

unread,
Nov 28, 2018, 9:53:52 AM11/28/18
to kaldi...@googlegroups.com
there should be a parameter like "skip_kws" or something like that in the run-4-anydecode.sh -- set that to true;
y.

bl2...@columbia.edu

unread,
Nov 28, 2018, 7:10:20 PM11/28/18
to kaldi-help
Thanks again, Yenda! Another question, should I do the run-4-anydecode.sh first, or the chain models?

Jan Trmal

unread,
Nov 28, 2018, 7:12:08 PM11/28/18
to kaldi...@googlegroups.com
I think I wouldn't care about decoding anything else than the chain system.
So I think you can ignore the decoding for now.
y.

bl2...@columbia.edu

unread,
Dec 12, 2018, 1:11:23 AM12/12/18
to kaldi-help
I am unable to run the script

As I don't have a train_cleaned directory. I looked at run-1, 2, and 4 (which I ran as stated above), but don't see where this is called. How do I get the cleaned dirs?

Jan Trmal

unread,
Dec 12, 2018, 4:29:29 AM12/12/18
to kaldi...@googlegroups.com
There is a script in local called run_cleanup.sh or something like that.
Y.

bl2...@columbia.edu

unread,
Dec 15, 2018, 8:19:57 PM12/15/18
to kaldi-help
Last question, I successfully ran the chain script, and decoded it with run-4. How do I calculate the WER and CER?

bl2...@columbia.edu

unread,
Dec 16, 2018, 1:40:33 AM12/16/18
to kaldi-help
Forgot to set run_stt=True, but running run-4-anycode.sh, I see score files generated.

So in the decode_dev10h.pem/ folder, there is score_8 to score_12. Each has several files that list WER and CER. Is there a script/way to see the WER and CER for all 10 hours, instead of these subsets.

Daniel Povey

unread,
Dec 16, 2018, 1:31:52 PM12/16/18
to kaldi-help
Those aren't subsets, they are different LM scales.  You'd normally do something like

grep Sum exp/your_dir/score_*/*ys | utils/best_wer.sh

Usually we have RESULTS  files that make it clear what to do, but in the BABEL setup I see the commands to do this aren't in the results files.
Dan

Reply all
Reply to author
Forward
0 new messages