How to remove bad transcriptions more efficiently?

499 views
Skip to first unread message

Cheng-Hung Hsueh

unread,
Oct 25, 2016, 10:49:14 AM10/25/16
to kaldi-help
I have 150h recordings. But got 86% WER with SAT model.
%WER 86.13 [ 18940 / 21991, 1328 ins, 5352 del, 12260 sub ] 
I removed 1 hour bad transcriptions manually and got 2% WER decrease.

The most problems of bad transcriptions are `more or less words`, `background music or noise`.
I tried to remove the transcriptions with WARN in log/acc.* and log/align.*, but not work.

20% transcriptions without WARN are bad.
40% transcriptions with WARN are bad.

I checked transcriptions manually one by one.
How can detect transcription errors more efficiently?

Daniel Povey

unread,
Oct 25, 2016, 2:53:24 PM10/25/16
to kaldi-help
steps/cleanup/clean_and_segment_data.sh

--
You received this message because you are subscribed to the Google Groups "kaldi-help" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kaldi-help+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Cheng-Hung Hsueh

unread,
Oct 26, 2016, 8:35:52 AM10/26/16
to kaldi-help, dpo...@gmail.com
I ran and got the error `Wide character in die at utils/scoring/wer_per_utt_details.pl line 91, <STDIN> line 236333.`
My transcripts contain colons, `;`

wer_per_utt_details.pl sets colon as separator.
Is it documented?

Dan Povey於 2016年10月26日星期三 UTC+8上午2時53分24秒寫道:
steps/cleanup/clean_and_segment_data.sh

To unsubscribe from this group and stop receiving emails from it, send an email to kaldi-help+...@googlegroups.com.

Jan Trmal

unread,
Oct 26, 2016, 10:33:35 AM10/26/16
to kaldi-help, Dan Povey
You can add the parameter --separator "@" to align-text and wer_per_utt_details.pl to use '@' as the separator
y.

To unsubscribe from this group and stop receiving emails from it, send an email to kaldi-help+unsubscribe@googlegroups.com.

Daniel Povey

unread,
Oct 26, 2016, 2:43:49 PM10/26/16
to Jan Trmal, kaldi-help
Actually we're going to push a fix soon so that it doesn't matter if the sentences contain that separator.

Dan

Jan Trmal

unread,
Oct 26, 2016, 10:12:16 PM10/26/16
to kaldi-help
It's been fixed. Pull the changes from Kaldi master -- it should work automatically, even without adding the parameter.
Let us know if it's not wrking.
y.

Ihc

unread,
Oct 27, 2016, 9:21:05 AM10/27/16
to kaldi...@googlegroups.com
Thanks for fixing it so fast. :)

I will retrain the model again.

You received this message because you are subscribed to a topic in the Google Groups "kaldi-help" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/kaldi-help/BaXSEJGFip4/unsubscribe.
To unsubscribe from this group and all its topics, send an email to kaldi-help+unsubscribe@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.



--
╯︵︵︵︵︵︵︵︵︵︵╰            | ̄ ̄ ̄ ̄| | ̄ ̄ ̄ ̄|
/           \            | ((oo) | | ((oo) |
/︵︵︵︵︵︵︵︵︵︵︵\            |________| |________| 
|           |            |      |
| /\___/\   /\__/\  |            |      |
|  . .   . .   |    /\__/\  /\___/\ |  /\___/\ |
|  (( oo) (oo ))   |    ˙(oo)˙ ˋ(°oo ° )ノ   ˋ(°oo ° )ノ

Cheng-Hung Hsueh

unread,
Oct 28, 2016, 6:39:13 PM10/28/16
to kaldi-help
I got the error in the stage 5 of `steps/cleanup/lattice_oracle_align.sh`
  
get_ctm_edits.py: could not make sense of edits line:
utter_name word1 word1 ; word2 word2 ; 

There are two `get_ctm_edits.py`

steps/cleanup/internal/get_ctm_edits.py

steps/cleanup/get_ctm_edits.py


The latter is called.


I will PR by using regular expression in line282 and keep compatibility. 


Line 282         edits_array = [ x.split() for x in edits_fields.split(";") ]


Yenda於 2016年10月27日星期四 UTC+8上午10時12分16秒寫道:

Daniel Povey

unread,
Oct 28, 2016, 6:52:42 PM10/28/16
to kaldi-help
Can you please copy get_ctm_edits.py to internal/get_ctm_edits.py, update the script to use the one in internal/, and remove get_ctm_edits.py?

It was not a good design to use ; for a separator there.  We have encountered this issue elsewhere.

Instead I recommend that you split on space (line.split()) then just verify that every third field is ';'.  

Please make a PR when you have got it working.

To unsubscribe from this group and stop receiving emails from it, send an email to kaldi-help+unsubscribe@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages