Question about transformer LM

Wenjie Peng

unread,

Mar 29, 2021, 11:12:33 PM3/29/21

to kaldi-help

Hi,

I am using the latest pytorch-based nn lm for rescoring, where 4-gram LM is for first pass and transformer LM for second pass. I thought transformer LM could improve the performance significantly, while the actual refinement is limited.

After going through the paper, I found the size of lattice and nbest list is important to WER, therefore I am finetuning these two parameters but get some marginal improvement. In addition, I also noticed the default lr scheduler in kaldi doesn't contain the warm-up stage. Thus I am wondering why kaldi take the warm-up-free strategy to train transformer LM.

Thanks in advance!

keli...@gmail.com

unread,

Mar 30, 2021, 12:59:38 AM3/30/21

to kaldi-help

Thanks for your question. I assume you're talking about N-best rescoring with the Pytorch-based nnlm? If you compare Transformer LM with LSTM LM, the perplexities with the two models on SWBD are indeed similar, and thus the WERs are not quite different. This may because 1) we do not tune the Transformer LM too hard, 2) the training data on SWBD, even with Fisher, is not that much to make Transformers easily outperform LSTMs. While compared with 4-gram LM, Transformer LM improves WER from 12.8% to 10.8% by N-best rescoring, which I think is what we usually expect from a neural LM. So could you explain a bit more about 'actual refinement is limited'?

As for the lr scheduler and warmup trick, it is a good question. When we first experiment with the PyTorch-based Transformer LM, we observed that SGD gives better performance than Adam in our experiments. So we do not experiment with the warm-up trick and the original lr scheduler in the 'attention is all you need' paper. We're modifying the code to match the commonly adopted lr scheduler for Transformers now and will update the code and results if we get significantly better results than SGD.

Let me know whether this answers your questions.

Ke

Message has been deleted

wenjie-p

unread,

Mar 30, 2021, 8:57:27 AM3/30/21

to kaldi-help

Thanks for the reply, which is very useful to help me out! I am sorry for forgetting to mention that the data set, which in my experiment is the commonvoice German dataset rather than SWBD. Compared with the ngram-based two-pass decoding, the Pytorch-based nnlm could further decrease the WER of about 0.8% absolutely, while 2% absolute WER for SWBD. Since the commonvoice German data set is about 600h while SWBD is about 300h, I used to think that transformer nnlm could also bring significant improvement. That's why I said the 'actual refinement is limited'. As you mentioned the Fisher data, it makes me realized that such difference can be due to the different scales of text data used for training Pytorch-based nnlm. In addition, I have tried to finetune the size of nbest list and lattice as the paper suggested, where increasing `N` for nbest list while decreasing `epsilon` for lattice could indeed improve the WER.

keli...@gmail.com

unread,

Mar 30, 2021, 10:35:15 AM3/30/21

to kaldi-help

I see. Yes, I think the gain of neural LMs usually depends on datasets and the size of training data, given a specific rescoring method. Do you know how many words are in your training set? If it's large enough, I think two things may further improve the performance: 1) further tune the Transformer LM, e.g., increase the model size, modify lr, etc. 2) train a Transformer-XL LM and use it with the Kaldi's rescoring pipeline. We found that, compared with the standard Transformer LM, Transformer-XL can further reduce WER on SWBD to 10.2% or10.3%, depending on 'N' and 'epsilon'. While the rescoring time significantly increases. So we haven't supported Transformer-XL yet.

Ke

wenjie-p

unread,

Mar 30, 2021, 10:22:14 PM3/30/21

to kaldi-help

Thanks for the reply. The words in my training set is roughly 1/3 of SWBD, which suggests that we shall reduce the model's complexity. Thus, I tried to change the number of hidden layers from 6 to 4. Surprisingly, I found the WER increased slightly. I am now considering increase the model's complexity as what you have suggested, but I am very confused why we need to increase model's complexity on a smaller dataset. I think what your suggestions are based on the premise of 'large enough', where I am not clear what kind of quantity can be regarded as 'large enough' (i.e. 1/2, 1/3, 1/4 of SWBD etc. )? Should the language difference be taken into consideration when we estimate the size of training text data?

Daniel Povey

unread,

Apr 1, 2021, 3:45:20 AM4/1/21

to kaldi-help

Neural LMs are very data-hungry. 1/3 of swb is very little to train an LM on. In the swb recipe we use more data than just the Swb transcripts, for LM training.

--
Go to http://kaldi-asr.org/forums.html to find out how to join the kaldi-help group
---
You received this message because you are subscribed to the Google Groups "kaldi-help" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kaldi-help+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/kaldi-help/eb36cbd4-6663-4350-ad5e-9cb90b14ba2fn%40googlegroups.com.

keli...@gmail.com

unread,

Apr 1, 2021, 3:40:47 PM4/1/21

to kaldi-help

Just add a little bit of what Dan mentioned. We use both SWBD and Fisher transcripts to train the neural LMs. With 1/3 of SWBD only, you may want to decrease the model size as you did or increase the degree of regularization. Though the difficulty of LM modeling relates to language, I doubt language difference matters more than data size in your case.

软件开发工作经验

unread,

Apr 1, 2021, 10:01:07 PM4/1/21

to kaldi-help

This probably means you have phones that were unseen in training

and were not shared with other phones in the roots file.

You should modify your roots file as necessary to fix this.

(i.e. share that phone with a similar but seen phone on one line

of the roots file). Be sure to regenerate roots.int from roots.txt,

if using s5 scripts. To work out the phone, search for

pdf-id i in the output of show-transitions (for this model).

the below is the result of show-transitions

Transition-state 689: phone = t hmm-state = 0 pdf = 227

Transition-id = 1385 p = 0.951219 [self-loop]

Transition-id = 1386 p = 0.0487805 [0 -> 1]

Transition-state 690: phone = t hmm-state = 1 pdf = 227

Transition-id = 1387 p = 0.866667 [self-loop]

Transition-id = 1388 p = 0.133333 [1 -> 2]

Transition-state 691: phone = t hmm-state = 2 pdf = 227

Transition-id = 1389 p = 0.857143 [self-loop]

Transition-id = 1390 p = 0.142857 [2 -> 3]

Transition-state 692: phone = ʧ hmm-state = 0 pdf = 228

Transition-id = 1391 p = 0.796954 [self-loop]

Transition-id = 1392 p = 0.203046 [0 -> 1]

Transition-state 693: phone = ʧ hmm-state = 1 pdf = 312

Transition-id = 1393 p = 0.837618 [self-loop]

Transition-id = 1394 p = 0.162382 [1 -> 2]

Transition-state 694: phone = ʧ hmm-state = 2 pdf = 308

Transition-id = 1395 p = 0.824561 [self-loop]

Transition-id = 1396 p = 0.175439 [2 -> 3]

Transition-state 695: phone = ʨ hmm-state = 0 pdf = 229

Transition-id = 1397 p = 0.75 [self-loop]

Transition-id = 1398 p = 0.25 [0 -> 1]

Transition-state 696: phone = ʨ hmm-state = 1 pdf = 229

Transition-id = 1399 p = 0.75 [self-loop]

Transition-id = 1400 p = 0.25 [1 -> 2]

Transition-state 697: phone = ʨ hmm-state = 2 pdf = 229

Transition-id = 1401 p = 0.75 [self-loop]

Transition-id = 1402 p = 0.25 [2 -> 3]

Transition-state 698: phone = td hmm-state = 0 pdf = 230

Transition-id = 1403 p = 0.75 [self-loop]

Transition-id = 1404 p = 0.25 [0 -> 1]

Transition-state 699: phone = td hmm-state = 1 pdf = 230

Transition-id = 1405 p = 0.75 [self-loop]

Transition-id = 1406 p = 0.25 [1 -> 2]

Transition-state 700: phone = td hmm-state = 2 pdf = 230

Transition-id = 1407 p = 0.75 [self-loop]

Transition-id = 1408 p = 0.25 [2 -> 3]

Transition-state 701: phone = tʤ hmm-state = 0 pdf = 231

Transition-id = 1409 p = 0.75 [self-loop]

Transition-id = 1410 p = 0.25 [0 -> 1]

Transition-state 702: phone = tʤ hmm-state = 1 pdf = 231

Transition-id = 1411 p = 0.75 [self-loop]

Transition-id = 1412 p = 0.25 [1 -> 2]

Transition-state 703: phone = tʤ hmm-state = 2 pdf = 231

Transition-id = 1413 p = 0.75 [self-loop]

Transition-id = 1414 p = 0.25 [2 -> 3]

Transition-state 704: phone = ʧd hmm-state = 0 pdf = 232

Transition-id = 1415 p = 0.75 [self-loop]

Transition-id = 1416 p = 0.25 [0 -> 1]

Transition-state 705: phone = ʧd hmm-state = 1 pdf = 232

Transition-id = 1417 p = 0.75 [self-loop]

Transition-id = 1418 p = 0.25 [1 -> 2]

Transition-state 706: phone = ʧd hmm-state = 2 pdf = 232

Transition-id = 1419 p = 0.75 [self-loop]

Transition-id = 1420 p = 0.25 [2 -> 3]

Transition-state 707: phone = ʧʤ hmm-state = 0 pdf = 233

Transition-id = 1421 p = 0.75 [self-loop]

Transition-id = 1422 p = 0.25 [0 -> 1]

Transition-state 708: phone = ʧʤ hmm-state = 1 pdf = 233

Transition-id = 1423 p = 0.75 [self-loop]

Transition-id = 1424 p = 0.25 [1 -> 2]

Transition-state 709: phone = ʧʤ hmm-state = 2 pdf = 233

Transition-id = 1425 p = 0.75 [self-loop]

Transition-id = 1426 p = 0.25 [2 -> 3]

yurii mytiai

unread,

Apr 5, 2021, 4:07:58 AM4/5/21

to kaldi-help

Hi!
How many text data did you use for the LM training (I mean in MB)?
Thanks!

четвер, 1 квітня 2021 р. о 22:40:47 UTC+3 keli...@gmail.com пише:

wenjie-p

unread,

Apr 5, 2021, 10:20:01 PM4/5/21

to kaldi-help

Hi,

I have just checked the text data for model training, where my data is 27M while SWBD is 150M, and the #utt of mine 480402 while 3004158 for SWBD. It is suggested that the data of mine is roughly 1/6 of SWBD, which is different from my previous description of 1/3 because that number was for the data with 3-way SP.

I have been finetuning the hparams, where I found increase dropout rate is much more efficiency than decrease nlayers. Decreasing batch size is also efficient. However, I am very confused by the code in `steps/pytorchnn/data.py` below:

```
def tokenize(self, path):
     """Tokenizes a text file."""
     assert os.path.exists(path)
     with open(path, 'r', encoding='utf-8') as f:
     all_ids = []
     for line in f:
         words = line.split() + ['<s>']
         ids = []

```

It seems that kaldi will append the 'begin of sentence' token `<s>` at the end of each sentence. Two questions puzzled me:

1) why append <s> at the end of every sentence rather than </s>?

2) why only add <s> rather than appending <s> and </s> at the begining and end of sentence respectively?

keli...@gmail.com

unread,

Apr 6, 2021, 11:19:31 AM4/6/21

to kaldi-help

We follow the way the Pytorch word language model example processes data: https://github.com/pytorch/examples/blob/master/word_language_model/data.py

Since we do not observe much difference in both perplexity and WER between the two ways, we just keep using one symbol to represent the sentence boundary. Note that we do not train the model on the sentence level, so this preprocessing works fine.

Ke

wenjie-p

unread,

Apr 6, 2021, 9:36:44 PM4/6/21

to kaldi-help

Thanks for the reply, I will continue digging into this new feature.

wenjie-p

unread,

Apr 6, 2021, 9:47:42 PM4/6/21

to kaldi-help

BTW, for my model training, after I appended <s> and </s> at the beginning and end respectively, I found the significant decrease of ppl (test ppl: 130 vs 82) while slight increase of WER.

I think the current implementation of kaldi (https://github.com/kaldi-asr/kaldi/blob/master/egs/wsj/s5/steps/pytorchnn/data.py#L43) and Pytorch example (https://github.com/pytorch/examples/blob/master/word_language_model/data.py#L41) is slightly different, where the former append <s> at the end of every sentence while the latter append </s>. Since such token is used to separate sentence, I didn't find significant improvement of WER with </s> at the end of every sentence.

Reply all

Reply to author

Forward