--
Go to http://kaldi-asr.org/forums.html find out how to join
---
You received this message because you are subscribed to the Google Groups "kaldi-help" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kaldi-help+...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
steps/online/nnet2/train_ivector_extractor.sh ...
utils/data/modify_speaker_info.sh --utts-per-spk-max 2 ... #for more diversity
steps/online/nnet2/extract_ivectors_online.sh ... splitted_speakers...
steps/cleanup/make_biased_lm_graphs.sh: creating utterance-group-specific decoding graphs with biased LMssteps/cleanup/segment_long_utterances.sh: Decoding with biased language models...steps/cleanup/decode_fmllr_segmentation.sh --beam 15.0 --lattice-beam 1.0 --nj 20 --cmd run.pl --mem 20G --mem 4G --skip-scoring true --allow-partial false exp/segment_long_utts_1e_train/graphs_uniform_seg exp/segment_long_utts_1e_train/train_uniform_seg exp/segment_long_utts_1e_train/latsfilter_scps.pl: warning: some input lines were output to multiple files [OK if splitting per utt]steps/cleanup/decode_segmentation.sh --scoring-opts --num-threads 1 --skip-scoring true --acwt 0.083333 --nj 20 --cmd run.pl --mem 20G --mem 4G --beam 10.0 --model exp/segment_long_utts_1e_train/final.alimdl --max-active 2000 exp/segment_long_utts_1e_train/graphs_uniform_seg exp/segment_long_utts_1e_train/train_uniform_seg exp/segment_long_utts_1e_train/lats.sisteps/cleanup/decode_segmentation.sh: feature type is lda^Tsteps/cleanup/decode_fmllr_segmentation.sh: feature type is ldasteps/cleanup/decode_fmllr_segmentation.sh: getting first-pass fMLLR transforms.dsteps/cleanup/decode_fmllr_segmentation.sh: doing main lattice generation phasesteps/cleanup/decode_fmllr_segmentation.sh: estimating fMLLR transforms a second time.steps/cleanup/decode_fmllr_segmentation.sh: doing a final pass of acoustic rescoring.steps/cleanup/internal/get_ctm.sh --lmwt 10 --cmd run.pl --mem 20G --mem 4G --print-silence true exp/segment_long_utts_1e_train/train_uniform_seg data/langp exp/segment_long_utts_1e_train/latssteps/cleanup/segment_long_utterances.sh: using default values of non-scored words...run.pl: 1 / 20 failed, log is in exp/segment_long_utts_1e_train/lats/log/get_ctm_edits.*.log
# steps/cleanup/internal/stitch_documents.py --query2docs=exp/segment_long_utts_1e_train/query_docs/split20/relevant_docs.3.txt --input-documents=exp/segment_long_utts_1e_train/docs/split20/docs.3.txt --output-documents=- | steps/cleanup/internal/align_ctm_ref.py --eps-symbol="<eps>" --oov-word='<unk>' --symbol-table=data/langp/words.txt --hyp-format=CTM --align-full-hyp=false --hyp=exp/segment_long_utts_1e_train/lats/score_10/train_uniform_seg.ctm.3 --ref=- --output=exp/segment_long_utts_1e_train/lats/score_10/train_uniform_seg.ctm_edits.3 # Started at Tue Sep 5 19:43:52 CEST 2017#2017-09-05 19:44:09,006 [steps/cleanup/internal/align_ctm_ref.py:599 - main - ERROR ] Failed to align ref and hypotheses; got exception Traceback (most recent call last): File "steps/cleanup/internal/align_ctm_ref.py", line 596, in main run(args) File "steps/cleanup/internal/align_ctm_ref.py", line 540, in run for reco, ref_text in read_text(args.ref_in_file): File "steps/cleanup/internal/align_ctm_ref.py", line 128, in read_text "".format(line, text_file.name))RuntimeError: Did not get enough columns; line epspk1219-epf18bc63a.1-000000-003000 in <stdin>2017-09-05 19:44:09,047 [steps/cleanup/internal/stitch_documents.py:136 - run - ERROR ] Error processing line epspk1230-ep18a49f30.1-010000-012200 epspk1230-ep18a49f30.1,1.00,1.00 in file exp/segment_long_utts_1e_train/query_docs/split20/relevant_docs.3.txt2017-09-05 19:44:09,047 [steps/cleanup/internal/stitch_documents.py:147 - main - ERROR ] Failed to stictch document; got error Traceback (most recent call last): File "steps/cleanup/internal/stitch_documents.py", line 144, in main run(args) File "steps/cleanup/internal/stitch_documents.py", line 133, in run file=args.output_documents)IOError: [Errno 32] Broken pipe
Hi Dan,I finished running the recipe on the data from a tv show (text file derived from subtitles) and results look generally ok.However, there are a notable number of segments which include few unspoken words,or miss spoken words (usually in the beginning or end of the segment but also in the middle),or finish sooner than the last word is completely said. There are also a few segments which are quite inaccurate, not matching spoken audio at all, or missing a notable number of words.
My question is, is this level of quality good enough to include this new material for training?
steps/cleanup/segment_long_utterances.sh: using default values of non-scored words...
2019-05-29 06:19:28,145 [steps/cleanup/internal/get_non_scored_words.py:98 - read_lang - ERROR ] problem reading file data/lang//words.txt.int
Traceback (most recent call last):
File "steps/cleanup/internal/get_non_scored_words.py", line 108, in <module>
read_lang(args.lang)
File "steps/cleanup/internal/get_non_scored_words.py", line 93, in read_lang
for line in open(lang_dir + '/words.txt', encoding='utf-8').readlines():
File "/usr/local/lib/python3.6/codecs.py", line 321, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc9 in position 6792: invalid continuation byte
utils/data/get_utt2dur.sh: computed data/train_long_segmented/workdir/train_long_uniform_seg.temp/utt2dur
utils/data/modify_speaker_info.sh: copied data from data/train_long_segmented/workdir/train_long_uniform_seg.temp to data/train_long_segmented/workdir/train_long_uniform_seg, number of speakers changed from 1 to 2
utils/validate_data_dir.sh: Successfully validated data-directory data/train_long_segmented/workdir/train_long_uniform_seg
fix_data_dir.sh: kept all 2 utterances.
fix_data_dir.sh: old files are kept in data/train_long_segmented/workdir/train_long_uniform_seg/.backup
steps/compute_cmvn_stats.sh data/train_long_segmented/workdir/train_long_uniform_seg/
Succeeded creating CMVN stats for train_long_uniform_seg
steps/cleanup/segment_long_utterances.sh: Stage 3 (Building biased-language-model decoding graphs)
steps/cleanup/make_biased_lm_graphs.sh --nj 1 --cmd run.pl data/train_long//text data/lang/ data/train_long_segmented/workdir data/train_long_segmented/workdir/graphs
---------------------
data/train_long//text
data/lang/
sym2int.pl: replacing i with 696
sym2int.pl: replacing would with 696
sym2int.pl: replacing like with 696
sym2int.pl: replacing to with 696
sym2int.pl: replacing have with 696
sym2int.pl: replacing a with 696
sym2int.pl: replacing corrected with 696
sym2int.pl: replacing statement with 696
sym2int.pl: replacing from with 696
sym2int.pl: replacing aarp with 696
sym2int.pl: replacing concerning with 696
sym2int.pl: replacing my with 696
sym2int.pl: replacing prescription with 696
sym2int.pl: replacing drug with 696
sym2int.pl: replacing summary with 696
sym2int.pl: replacing that with 696
sym2int.pl: replacing is with 696
sym2int.pl: replacing sent with 696
sym2int.pl: replacing to with 696
sym2int.pl: replacing me with 696
sym2int.pl: not warning for OOVs any more times
** Replaced 137 instances of OOVs with 696
steps/cleanup/make_biased_lm_graphs.sh: creating utterance-group-specific decoding graphs with biased LMs
steps/cleanup/segment_long_utterances.sh: Decoding with biased language models...
> email to kaldi...@googlegroups.com.
To post to this group, send email to kaldi...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/kaldi-help/9df5a5e3-98d1-4c1f-b2e8-7dceb031f40a%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
The language is English
Will check for non utf8 but is the replacement of all words by SIL not an issue?
--
Go to http://kaldi-asr.org/forums.html find out how to join
---
You received this message because you are subscribed to the Google Groups "kaldi-help" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kaldi-help+...@googlegroups.com.
To post to this group, send email to kaldi...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/kaldi-help/fa0194fe-7ebe-402a-9a59-7248b28f8f14%40googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/kaldi-help/CAGLZPU-9gG1x5pLg-h_vOrJxwtw797Ht5iKjC8JrQXmhtwny6w%40mail.gmail.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/kaldi-help/CAEWAuySnooO5S51sfns9pA7%2BTdgMeF-TveVPvd90C9Jmg-t8gQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.
To view this discussion on the web visit https://groups.google.com/d/msgid/kaldi-help/CAGLZPU8kEYVftRmZ8J2poQGMOO5c_Tuc8FOYLgHjVHTdPAbm7w%40mail.gmail.com.
> email to kaldi...@googlegroups.com.
To unsubscribe from this group and stop receiving emails from it, send an email to kaldi-help+...@googlegroups.com.
To post to this group, send email to kaldi...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/kaldi-help/dd6b99b3-1dcc-4f8b-b614-c38a966bb71f%40googlegroups.com.