What is the accuracy of your ASR with Kaldi using this speech?

1,121 views
Skip to first unread message

Sabr Tasbolatov

unread,
Oct 28, 2016, 7:05:01 AM10/28/16
to kaldi-help
With YouTube's auto-generated English Language model, the subtitles look perfect for me, like 9/10 words were right, and with the fluent speech speed of UK Pr.Minister, there was no significant latency.


Just interesting, how far can Kaldi go with ASR online decoding?

Thanks

Daniel Povey

unread,
Oct 28, 2016, 3:49:31 PM10/28/16
to kaldi-help
Those are not automatically generated subtitles-- notice that it says who is speaking, e.g. 'The Prime Minister: .... '.  They were generated by a human.

You'd probably only get reasonably accurate subtitles (e.g. >90% accurate) if the acoustic model was trained on British English speech and the language model contained suitable data (e.g. parliamentary debates).  This is true of any ASR systems, not just Kaldi.

Dan


--
You received this message because you are subscribed to the Google Groups "kaldi-help" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kaldi-help+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Danijel Korzinek

unread,
Oct 29, 2016, 12:59:10 PM10/29/16
to kaldi-help, dpo...@gmail.com
You can actually turn both manual and auto-generated transcription in that example. The auto-generated seems real, especially when you see it stop transcribing when more than one person talks (yells) at once. However, we have no way of knowing if the auto-generated subtitles aren't using the manual ones for tuning. That would give the system a huge advantage and make it completely useless as a benchmark.

I suggest you upload a file of somtehing that  doesn't exist anywhere on youtube and check that out.
To unsubscribe from this group and stop receiving emails from it, send an email to kaldi-help+...@googlegroups.com.

Tony Robinson

unread,
Oct 31, 2016, 12:37:26 PM10/31/16
to kaldi...@googlegroups.com
I'm interested in this sort of thing.   I've just done some googling, if you want the video and both sets of subtitles then the command you need is:

$ youtube-dl --write-sub --write-auto-sub 'https://www.youtube.com/watch?v=-l03NVQf-D8'

(where you may need to do apt-get install youtube-dl or get it from https://github.com/rg3/youtube-dl)

Regarding real time and latency:   The reason everything appears with no significant latency is that either the ASR (in the case of automatic subtitles) or the alignment (in the case of manual transcription) is done in advance.   If you have a look at the downloads you can see the formats used.   These are then stitched together after the ASR/alignemt to give no significant latency.

Kaldi can do real time recognition, but the latency is in the order of 10s or so.   Many broadcasts aren't live, so this isn't a problem.   Here in the UK you can tell when the subtitles (closed captions) are live or not - the live ones have the latency and are less accurate.   The live ones are done by respeaking into a speaker-specific ASR system, I really don't know if this has longer or shorter latency than sending directly to an online speaker independent ASR.


Tony
--
Speechmatics is a trading name of Cantab Research Limited
We are hiring: www.speechmatics.com/careers
Dr A J Robinson, Founder, Cantab Research Ltd
Phone direct: 01223 794096, office: 01223 794497
Company reg no GB 05697423, VAT reg no 925606030
51 Canterbury Street, Cambridge, CB4 3QG, UK

Daniel Povey

unread,
Oct 31, 2016, 4:50:28 PM10/31/16
to kaldi-help
On Mon, Oct 31, 2016 at 12:37 PM, Tony Robinson <to...@speechmatics.com> wrote:
I'm interested in this sort of thing.   I've just done some googling, if you want the video and both sets of subtitles then the command you need is:

$ youtube-dl --write-sub --write-auto-sub 'https://www.youtube.com/watch?v=-l03NVQf-D8'

(where you may need to do apt-get install youtube-dl or get it from https://github.com/rg3/youtube-dl)

Regarding real time and latency:   The reason everything appears with no significant latency is that either the ASR (in the case of automatic subtitles) or the alignment (in the case of manual transcription) is done in advance.   If you have a look at the downloads you can see the formats used.   These are then stitched together after the ASR/alignemt to give no significant latency.

Kaldi can do real time recognition, but the latency is in the order of 10s or so.  


Just a note: the latency of real-time recognition in Kaldi depends on a lot of factors including the model types.  I'm talking about the online2-nnet{2,3} setups here.  The way it's designed, if you can get your system to run in real time (meaning, it doesn't take longer than real time to run), then it processes the data as it comes in and there should be no significant latency at the end of the utterance (e.g. less than 0.1 second).  This assumes that you can figure out what the end of the utterance is (e.g. you reached the end of the grammar, you saw enough silence, or someone pressed a button saying they were done).

If you run this same system for captioning, you can get the best path in real time with essentially no latency (again, assuming the system is small enough to be run in real time, e.g. a 'chain' system).  However, the best path may change including deleting or changing past words, so you'd need the ability to go back and erase previously emitted words; the system does not have the capability to work out when words become "inevitable" in that they can't be affected by later context.  The original 'online' setup (src/online/) did this, but I felt it was a feature that was very specific to certain applications, and for instance that design made it impossible to get lattices, so the online2 design doesn't include that feature.  [however, you could get that feature back by messing about deep enough in the code.]


 

On 29/10/16 17:59, Danijel Korzinek wrote:
You can actually turn both manual and auto-generated transcription in that example. The auto-generated seems real, especially when you see it stop transcribing when more than one person talks (yells) at once. However, we have no way of knowing if the auto-generated subtitles aren't using the manual ones for tuning. That would give the system a huge advantage and make it completely useless as a benchmark.

I suggest you upload a file of somtehing that  doesn't exist anywhere on youtube and check that out.

On Friday, October 28, 2016 at 9:49:31 PM UTC+2, Dan Povey wrote:
Those are not automatically generated subtitles-- notice that it says who is speaking, e.g. 'The Prime Minister: .... '.  They were generated by a human.

You'd probably only get reasonably accurate subtitles (e.g. >90% accurate) if the acoustic model was trained on British English speech and the language model contained suitable data (e.g. parliamentary debates).  This is true of any ASR systems, not just Kaldi.

Dan


On Fri, Oct 28, 2016 at 7:05 AM, Sabr Tasbolatov <sabrtas...@gmail.com> wrote:
With YouTube's auto-generated English Language model, the subtitles look perfect for me, like 9/10 words were right, and with the fluent speech speed of UK Pr.Minister, there was no significant latency.


Just interesting, how far can Kaldi go with ASR online decoding?

Thanks
--
You received this message because you are subscribed to the Google Groups "kaldi-help" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kaldi-help+...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "kaldi-help" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kaldi-help+unsubscribe@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.


--
Speechmatics is a trading name of Cantab Research Limited
We are hiring: www.speechmatics.com/careers
Dr A J Robinson, Founder, Cantab Research Ltd
Phone direct: 01223 794096, office: 01223 794497
Company reg no GB 05697423, VAT reg no 925606030
51 Canterbury Street, Cambridge, CB4 3QG, UK

--
You received this message because you are subscribed to the Google Groups "kaldi-help" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kaldi-help+unsubscribe@googlegroups.com.

Wonkyum Lee

unread,
Oct 31, 2016, 5:11:45 PM10/31/16
to kaldi-help, dpo...@gmail.com
One example of real-time Kaldi recognizer is @ https://api.gridspace.com/scripts/try


--
Speechmatics is a trading name of Cantab Research Limited
We are hiring: www.speechmatics.com/careers
Dr A J Robinson, Founder, Cantab Research Ltd
Phone direct: 01223 794096, office: 01223 794497
Company reg no GB 05697423, VAT reg no 925606030
51 Canterbury Street, Cambridge, CB4 3QG, UK

--
You received this message because you are subscribed to the Google Groups "kaldi-help" group.
Reply all
Reply to author
Forward
0 new messages