Transformer model for speech recognition

Tomasz Latkowski

unread,

Aug 16, 2018, 9:04:16 AM8/16/18

to tensor2tensor

Hi!

I have a question regarding the transformer model trained for speech recognition problem. I am currently testing several ASR models and I was wondering how ASR based on Transformer architecture yields in comparision to the other architectures, for example: DeepSpeech. Is there any article/paper comparing transformer-based ASR with other models?

BR,

Tomasz

Lukasz Kaiser

unread,

Aug 16, 2018, 6:15:51 PM8/16/18

to tlatk...@gmail.com, tensor2tensor

Hi!

> I have a question regarding the transformer model trained for speech recognition problem. I am currently testing several ASR models and I was wondering how ASR based on Transformer architecture yields in comparision to the other architectures, for example: DeepSpeech. Is there any article/paper comparing transformer-based ASR with other models?

For us, the ASR Transformer yields very good results on Librispeech
even without a LM (about 7.X% WER). We believe it's very good, but
there are few publications on Librispeech without LM so it's hard to
compare (best with LM seem to go down to 5.X%). It'd be great if you
wanted to take on it!

Lukasz

Tomasz Latkowski

unread,

Aug 20, 2018, 4:43:50 AM8/20/18

to tensor2tensor

Lukasz, thank you for your fast answer!

I believe that transformer model can achieve similar performance like much more complex and heavier models, such as models based on RNN+CTC+LM, i'm going to carry on research on that path.

BR,

Tomasz

Tomasz Latkowski

unread,

Aug 28, 2018, 9:34:24 AM8/28/18

to tensor2tensor

@Lukasz,

One additional question, do you remember how long (approximately) did it take to train the transfomer model on the LibriSpeech dataset with usage of Cloud TPU?

Thanks!

Lukasz Kaiser

unread,

Aug 29, 2018, 12:12:04 PM8/29/18

to Tomasz Latkowski, tensor2tensor

> One additional question, do you remember how long (approximately) did it take to train the transfomer model on the LibriSpeech dataset with usage of Cloud TPU?

We trained first on short sequences, then on all of them, to speed up
the process -- as described as in the tutorial:
https://tensorflow.github.io/tensor2tensor/tutorials/asr_with_transformer.html

With this speedup method you should get reasonable results in about 12
hours (~8% WER) and really good (~7% WER) in 40h or so.

Give it a try and let us know if bug have crept in or if there're any
problems reproducing!

Lukasz

Abdul Rafay Khalid

unread,

Sep 13, 2018, 2:02:50 PM9/13/18

to tensor2tensor

Hi Lukasz.

Thanks for developing the Tensor2Tensor codebase.I am trying to reproduce the results on librispeech clean. I do not have access to TPU but I do have a multigpu machine. I've run about 500000 steps in the truncated utterance mode with the following settings and am seeing a WER of 40%.

t2t-trainer \
--model=transformer \
--hparams_set=transformer_librispeech \
--problem=librispeech_clean \
--train_steps=500000 \
--eval_steps=3 \
--local_eval_frequency=100 \

Can you give me some idea on the number of steps, I need to run on both the truncated and with complete utterances to get similar results. Thanks :)

刘奎

unread,

Sep 16, 2018, 10:50:21 PM9/16/18

to tensor2tensor

Hi,Khalid.

Where is "https://tensorflow.github.io/tensor2tensor/tutorials/asr_with_transformer.html ". I could not find it.Could you send me a copy? I am a begginer ot t2t.

BR

Zack

在 2018年9月14日星期五 UTC+8上午2:02:50，Abdul Rafay Khalid写道：

Ryan Sepassi

unread,

Sep 17, 2018, 12:09:06 PM9/17/18

to liukui19...@gmail.com, tensor2tensor

It's been moved to https://cloud.google.com/tpu/docs/tutorials/automated-speech-recognition

And referenced in the cloud_tpu docs https://github.com/tensorflow/tensor2tensor/blob/master/docs/cloud_tpu.md

--
You received this message because you are subscribed to the Google Groups "tensor2tensor" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tensor2tenso...@googlegroups.com.
To post to this group, send email to tensor...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tensor2tensor/ddc77c9b-a0bf-47bd-9e82-8141bbef4efa%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Abdul Rafay Khalid

unread,

Sep 17, 2018, 1:16:37 PM9/17/18

to tensor2tensor

Hi Ryan

What WER do you see after running the TPU example?

刘奎

unread,

Sep 19, 2018, 10:03:10 AM9/19/18

to tensor2tensor

Got it! Thanks very much!

在 2018年9月18日星期二 UTC+8上午12:09:06，Ryan Sepassi写道：

Lam Dang

unread,

Sep 27, 2018, 1:43:20 AM9/27/18

to tensor2tensor

Hi Khalid,

I ran into the same issue as you, using params_set=transformer_librispeech_v2

The issue is that when running on tpu, batch_size=16 means 1 batch = 16 utterances; While by default it use dynamic batching and I think it only use 1 example per step.

(see https://github.com/tensorflow/tensor2tensor/blob/master/tensor2tensor/data_generators/problem.py L885+)

In order to have a real batchsize=16, you need to add use_fixed_batch_size=True to force it to use 16 samples.

Some numbers from my experiments - GPU 1080Ti:

without use_fixed_batch_size=True: steps/sec = 10+, accuracy (1e6 steps) = 65%, GPU utilization ~ 50%

with use_fixed_batch_size=True: steps/sec = 1.5, accuracy (170K steps so far) = 89%, GPU utilization ~ 90%

Hope this help.

Abdul Rafay Khalid

unread,

Sep 27, 2018, 2:12:42 PM9/27/18

to tensor2tensor

Thanks a lot Lam. I'm going to try this out and let you know how my experiment goes. I wasn't aware of the dynamic batching scheme. Thanks for pointing that out to me. I'll take a look.

Surendra Reddy

unread,

Oct 18, 2018, 4:38:59 AM10/18/18

to tensor2tensor

Hi Lukasz,

What method did you use for WER?

Is the WER metric captured in the ASR Transformer tutorial? If yes, could you give me a pointer to the code implementation?