kaldi-rnnlm performance on longer utterances

David

unread,

Feb 9, 2018, 10:35:30 AM2/9/18

to kaldi-help

I've been looking at the script (https://github.com/kaldi-asr/kaldi/blob/master/egs/swbd/s5c/local/rnnlm/tuning/run_tdnn_lstm_1e.sh) to generate a Kaldi-rnnlm to examine its performance on my data. Thanks to all who have contributed to its release.

I cannot run the script directly as I don't have the resources to do so, but have deviated minimally and am training on Fisher transcripts. No signs of problems training with a final train/dev perplexity of 41.8 / 46.6 after 40 iterations of training (output at end of email).

For testing, I've based things on the compute_sentence_scores.sh script. My problem is, as utterance lengths here go beyond a few tens of words in length, the performance drops of markedly. I first noticed by scoring a single 1000 word utterance and the performance was noticably poor in terms of perplexity (after first say 30 words)

Demonstrating with a toy example, with contents:

--------------------------------------

test1 Where are you calling from

--------------------------------------

Gives average word log prob of -1.65.

Duplicating the phrase up to 5 times gives contents below:

--------------------------------------

test1 Where are you calling from Where are you calling from Where are you calling from Where are you calling from Where are you calling from

--------------------------------------

Original avg log prob per word -1.65 (perplexity 5)

5 * duplication avg log prob per word -1.28 (perplexity 4)

10 * duplication avg log prob per word -2.55 (perplexity 13)

20 * duplication avg log prob per word -4.97 (perplexity 144)

40 * duplication avg log prob per word -6.36 (perplexity 580)

Am I missing something here or are there any workarounds or, for instance, some mechanism to neutralise LSTM state periodically to alleviate this?

Any advice would be greatly appreciated and thanks again all Kaldi contributors.

David

----------------------------------------

Example training data

---------------------

<s> Hello my name Hello how are you doing

<s> Hi my name is Zelda

<s> My name's Monique

<s> Hi Monique

<s> How're you doing

<s> I'm fine

<s> And you

<s> Pretty good

<s> Good

Output of RNNLM build

---------------------

rnnlm/train_rnnlm.sh: best iteration (out of 40) was 38, linking it to final ite

ration.

rnnlm/train_rnnlm.sh: train/dev perplexity was 41.8 / 46.6.

Train objf: -5.20 -4.43 -4.25 -4.16 -4.10 -4.06 -4.02 -3.99 -3.97 -3.95 -3.93 -3

.92 -3.90 -3.89 -3.88 -3.87 -3.85 -3.85 -3.84 -3.83 -3.82 -3.81 -3.80 -3.80 -3.7

9 -3.78 -3.78 -3.77 -3.76 -3.76 -3.75 -3.75 -3.74 -3.74 -3.73 -3.73 -3.72 -3.72

-3.71 -3.71

Dev objf: -11.06 -11.06 -4.65 -4.34 -4.22 -4.15 -4.11 -4.07 -4.05 -4.02 -4.00

-3.98 -3.97 -3.96 -3.95 -3.94 -3.93 -3.92 -3.92 -3.91 -3.90 -3.90 -3.89 -3.89 -3

.88 -3.88 -3.88 -3.87 -3.87 -3.86 -3.86 -3.86 -3.85 -3.85 -3.85 -3.85 -3.85 -3.8

5 -3.84 -3.84 -3.84

Daniel Povey

unread,

Feb 9, 2018, 2:06:35 PM2/9/18

to kaldi-help

That's interesting.

Try adding "decay-time=10" to the xconfig lines that start with 'fast-lstmp-layer', and see if it helps. You can also try different values like 5 and 10; hopefully there will be a value that resolves the long-utterance problem while not degrading the results.

Dan

--
Go to http://kaldi-asr.org/forums.html find out how to join
---
You received this message because you are subscribed to the Google Groups "kaldi-help" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kaldi-help+unsubscribe@googlegroups.com.
To post to this group, send email to kaldi...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/kaldi-help/6ee2e8f4-b669-4a07-95a9-a582ef9cc00b%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

David

unread,

Feb 15, 2018, 1:33:46 PM2/15/18

to kaldi-help

Hi Dan,

Thanks for the swift response and suggestions. Had a look at this in the background and here's some perplexity numbers testing on a single 900 word utterance drawn from a different domain (no splits or sentence breaks of any kind). The kaldi-rnnlm numbers are computed using the compute_sentence_scores.sh script.

Ngram(3g) 275

No decay-time 8000

decay-time 5 234

decay-time 10 308

Given that all three systems reported very sensible perplexities at the end of the training process, I tried as an alternative the rnnlm-compute-prob binary used to calculate training perplexities.

No decay-time 208

decay-time 5 205

decay-time 10 206

Going back to compute_sentence_scores.sh, I repeated the experiment with each word treated as a separate utterance (each having exactly five words preceding context):

No decay-time 212

decay-time 5 205

decay-time 10 211

So overall the decay-time reduction definitely helps, but doesn't completely resolve the instability I'm seeing with unlimited prior context.

David

Daniel Povey

unread,

Feb 15, 2018, 1:41:28 PM2/15/18

to kaldi-help, Ke Li

I asked Ke (cc'd) to look into this.

Something else that might help (that she was going to investigate, but it's great if you can look into it too) is to increase chunk_length=32 in train_rnnlm.sh to 64, e.g. via the option "--chunk-length=64".

If that exhausts GPU memory you may have to reduce the --num-chunk-per-minibatch to rnnlm-get-egs; the default is 128 but I don't see that at the current time it's configurable at the script level. So reduce that to 64 if you run into memory problems.

Dan

--

Go to http://kaldi-asr.org/forums.html find out how to join
---
You received this message because you are subscribed to the Google Groups "kaldi-help" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kaldi-help+unsubscribe@googlegroups.com.
To post to this group, send email to kaldi...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/kaldi-help/289a43aa-77c3-480a-afb5-17ef17c2925c%40googlegroups.com.

Daniel Povey

unread,

Feb 15, 2018, 6:56:35 PM2/15/18

to kaldi-help, Ke Li

Actually looking at your numbers, it looks to me like 'decay-time=5' does resolve this problem, because it gives you the best numbers when using rnnlm-compute-prob, and (with that value) there is no degradation when computing the perplexities instead with compute_sentence_scores.sh.

David

unread,

Feb 16, 2018, 1:30:53 PM2/16/18

to kaldi-help

Dan,

Although decay-time 5 alleviates matters and doesn't obviously cause a degradation in performance, I still don't see that the long-span problem is completely resolved. I may not have explained my numbers sufficiently clearly. Here's the perplexity scores on 900+ words of text, all rnnlm results use the same rnnlm file (a new one with decay time of 5 and chunk length of 64 as suggested).

Perplexity

---------------------------------------------------------

1 ngram (trigram) 275

---------------------------------------------------------

2 rnnlm-compute-prob 205

---------------------------------------------------------

3 compute_sentence_scores.sh 240

4 compute_sentence_scores.sh-6g 205

5 compute_sentence_scores.sh-10g 203

---------------------------------------------------------

Here, system 2 I understand has the span limited to the chunk length and systems 4 and 5 are hard limited to 6 and 10 words context respectively (using a rolling word buffer). Therefore only system 3 takes all 900+ words without a context reset.

I don't have sufficient experience of using rnnlm's to know whether very long contexts should help (subject to caveats of diminishing returns) perplexity or whether using long contexts of hundreds of words can be expected to introduce instability.

Thanks,

David

Daniel Povey

unread,

Feb 16, 2018, 3:53:10 PM2/16/18

to kaldi-help

OK, it looks like there is still an issue.

I suppose we can view the RNNLM (or any recurrent neural net) as a chaotic system and maybe at some point in that sentence, it enters a "bad" attractor. Presumably if you had trained on sentences like that, it would have trained itself so that attractor did not exist.

It would be interesting to see whether training on longer chunks helps that problem; however, I think as long as the chunk-size is significantly longer than decay-time, the training chunk size won't make much difference.

I suspect that this kind of thing depends in a hard-to-predict way on the neural net topology and how you trained it. For instance I was messing with a particular type of factorized LSTM recently (where the W_all matrix, which combines all 8 full matrices of thee LSTM, is factorized into a product of two matrices-- a tall and a fat one-- and the first of the two constrained to have orthonormal rows, to prevent instability). I found that while that setup worked well for small data, on bigger setups (like Switchboard) the performance was poor and we had some kind of bad behavior for longer chunks-- the online decoding was worse than the regular decoding.

Dan

--
Go to http://kaldi-asr.org/forums.html find out how to join
---
You received this message because you are subscribed to the Google Groups "kaldi-help" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kaldi-help+unsubscribe@googlegroups.com.
To post to this group, send email to kaldi...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/kaldi-help/b8f8efef-c0c8-461f-8e01-0fe556447cf7%40googlegroups.com.

David

unread,

Feb 20, 2018, 1:15:04 PM2/20/18

to kaldi-help

Dan,

Thanks for your observations.

Out of interest, whilst I have the figures available, here's a more complete version of the table I posted on Friday. All RNNLM results use the same model, decay-count of 5 and a training chunk-length of 64. System 3 was recognising 900 words without any context reset, whilst systems 4+ are split during testing (using a rolling word buffer) to ensure a fixed left context size.

Perplexity

----------------------------------------------------------------

1 ngram (trigram) 275

----------------------------------------------------------------

2 rnnlm-compute-prob 205 114

----------------------------------------------------------------

3 compute_sentence_scores.sh 240 139

----------------------------------------------------------------

4 compute_sentence_scores.sh-4g 219 121

5 compute_sentence_scores.sh-5g 210

6 compute_sentence_scores.sh-6g 205

7 compute_sentence_scores.sh-10g 203 109

8 compute_sentence_scores.sh-16g 206 113

9 compute_sentence_scores.sh-24g 216 119

10 compute_sentence_scores.sh-28g 242 131

11 compute_sentence_scores.sh-31g 347

12 compute_sentence_scores.sh-32g 478

13 compute_sentence_scores.sh-33g 631 364

14 compute_sentence_scores.sh-34g 564

15 compute_sentence_scores.sh-35g 370

16 compute_sentence_scores.sh-40g 232

----------------------------------------------------------------

For some reason, this system which appears to train happily with a dev test perplexity of 46 on Fisher transcripts seems weirdly susceptible to left context size. Ten words context is good here (which happens to be the average duration of each training utterance). Thirty-three words context performs noticeably badly after which performance improves again. I added an additional column of 600 words from a different domain which exhibits the same pattern.

David

To unsubscribe from this group and stop receiving emails from it, send an email to kaldi-help+...@googlegroups.com.

To post to this group, send email to kaldi...@googlegroups.com.

Daniel Povey

unread,

Feb 20, 2018, 6:05:24 PM2/20/18

to kaldi-help

Try creating an additional copy of the training sentences to include in the training data, with appended-together groups of, say, 10 sentences.

If it works well we could start doing it as a matter of course.

To unsubscribe from this group and stop receiving emails from it, send an email to kaldi-help+unsubscribe@googlegroups.com.

To post to this group, send email to kaldi...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/kaldi-help/5af4481c-ee56-49ff-a5a3-95ffb306ac3e%40googlegroups.com.

David

unread,

Feb 20, 2018, 6:29:36 PM2/20/18

to kaldi-help

I was also thinking of concatenating training utterances. However, treating as an augmentation strategy by including both singular and merged forms, as you suggest, would be better. Will set this running before the weekend. Thanks.

Daniel Povey

unread,

Feb 20, 2018, 6:47:55 PM2/20/18

to kaldi-help

One option you have when contcatenating utterances is instead of just concatenating like:

hello whats your name

my name's dave

to

hello whats your name my name's dave

do instead:

hello whats your name </s> my name's dave

That is, separated by EOS characters. The option to do this was part of the plan from the start but it was never actually implemented before (there might still be bugs). IIRC I made sure that the validation script would accept this type of data. The idea is that it will predict the EOS, and the model itself can learn that, when seen as history, EOS basically behaves like a BOS character except that the preceding context should not be ignored as it's part of the same stream of text.

Dan

On Tue, Feb 20, 2018 at 6:29 PM, David <henria...@gmail.com> wrote:

I was also thinking of concatenating training utterances. However, treating as an augmentation strategy by including both singular and merged forms, as you suggest, would be better. Will set this running before the weekend. Thanks.

--
Go to http://kaldi-asr.org/forums.html find out how to join
---
You received this message because you are subscribed to the Google Groups "kaldi-help" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kaldi-help+unsubscribe@googlegroups.com.
To post to this group, send email to kaldi...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/kaldi-help/5baa4fa1-5ff9-4a81-b5e8-1ae2c5716d2a%40googlegroups.com.

François Hernandez

unread,

Feb 21, 2018, 6:01:18 AM2/21/18

to kaldi-help

Hi,

We conducted similar experiments on a RNNLM model we trained on the tedlium LM data. It seems to be more robust.
See the results below. (Ppl numbers are obtained based on average neg log probs.)

TEST 1 - concatenation of the same utterance
1 utt (39 words) : PPL = 63.6
20 utts (780 words) : PPL = 64.6
40 utts (1560 words) : PPL = 66.3

TEST2 - concatenation of 30 different utterances (933 words)
PPL based on average neg log probs of each utt : 85
PPL of concatenated utts : 128
But I believe this can also be explained by the fact we're concatenating utts which are not necessarily linked in any way, so the context might not be ideal at the beginning of each part.

François

Le mercredi 21 février 2018 00:47:55 UTC+1, Dan Povey a écrit :

One option you have when contcatenating utterances is instead of just concatenating like:
hello whats your name
my name's dave
to
hello whats your name my name's dave
do instead:
hello whats your name </s> my name's dave

That is, separated by EOS characters. The option to do this was part of the plan from the start but it was never actually implemented before (there might still be bugs). IIRC I made sure that the validation script would accept this type of data. The idea is that it will predict the EOS, and the model itself can learn that, when seen as history, EOS basically behaves like a BOS character except that the preceding context should not be ignored as it's part of the same stream of text.

Dan

On Tue, Feb 20, 2018 at 6:29 PM, David <henria...@gmail.com> wrote:

I was also thinking of concatenating training utterances. However, treating as an augmentation strategy by including both singular and merged forms, as you suggest, would be better. Will set this running before the weekend. Thanks.

--
Go to http://kaldi-asr.org/forums.html find out how to join
---
You received this message because you are subscribed to the Google Groups "kaldi-help" group.

To unsubscribe from this group and stop receiving emails from it, send an email to kaldi-help+...@googlegroups.com.

To post to this group, send email to kaldi...@googlegroups.com.

Jan Trmal

unread,

Feb 21, 2018, 6:45:05 AM2/21/18

to kaldi-help

do the lengths (or the statistics on the lengths) of the training sentences differ significantly for these two corpora?

For example, does tedlium include long sentences or is the average length much higher?

y

To unsubscribe from this group and stop receiving emails from it, send an email to kaldi-help+unsubscribe@googlegroups.com.

To post to this group, send email to kaldi...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/kaldi-help/fd43e643-a84e-4f2b-98a4-e71fde1621b0%40googlegroups.com.

François Hernandez

unread,

Feb 21, 2018, 7:07:55 AM2/21/18

to kaldi-help

I don't know about the data the first tests were conducted on but here are some stats for tedlium:
- 14,469,724 lines
- 249,941,072 words
--> around 17 words per line

Also, it contains some quite long utterances: around 5000 are more than 500 words. It's a small proportion of the whole data but still, may indeed help here.

David

unread,

Feb 23, 2018, 11:38:45 AM2/23/18

to kaldi-help

Here's the results using data augmentation to increase the average utterance length. As before first row is a simple ngram, second row is treating all 900 words as a single utterance with no context resets. The remaining column use a rotating buffer of words to ensure a fixed left context number of words. There are two augmented columns here, 80, which is the final iteration (results collated whilst dev set log probs were still being calculated) and 40 iterations, which turned out better. The 80 iteration augmentation network, counter intuitively, seemed worse with longer contexts and better than the unaugmented system for shorter contexts. The 40 iteration (half way) network was the best of all and is clearly more robust across all context lengths. So it looks as if overfitting is a factor as far as perplexities on this non-Fisher data is concerned, though possibly not the whole story. Notes on the topology of model build at end of post. The second system, which uses </s> conjunctions between appended utterances is underway.

I guess my best strategy is to incorporate the extra data I ultimately intended adding, rather than trying to get things right on Fisher alone first of all.

---------------------------Perplexity-------------------------

NonAugmented, Augmented(80 iters), Augmented(40 iters)

-----------------------------------------------------------------

1 ngram (trigram) 275

-----------------------------------------------------------------

whole 240 282 216

-----------------------------------------------------------------

4g 219 213 205

5g 210 205 199

6g 205 205 192

10g 203 196 186

16g 206 190 178

24g 216 187 177

28g 242 186 176

33g 478 195 182

40g 232 249 216

-----------------------------------------------------------------

Some notes on build of about 18m words of Fisher. Average of ten words per utterance before augmentation, around twenty words per utterance after. Default learning rate settings and topology specified below. Also uses 64 word chunk length and an aggressive decay time of 5.

embedding_dim=1024

lstm_rpd=256

lstm_nrpd=256

rnnlm/train_rnnlm_chunklen64.sh: best iteration (out of 80) was 77, linking it to final iteration.

rnnlm/train_rnnlm_chunklen64.sh: train/dev perplexity was 41.6 / 50.8.

Train objf: -5.37 -4.62 -4.43 -4.33 -4.26 -4.22 -4.18 -4.15 -4.13 -4.11 -4.09 -4.07 -4.05 -4.04 -4.03 -4.02 -4.01 -4.00 -3.99 -3.98 -3.96 -3.95 -3.95 -3.94 -3.94 -3.93 -3.92 -3.92 -3.91 -3.90 -3.90 -3.89 -3.89 -3.88 -3.88 -3.87 -3.86 -3.86 -3.85 -3.85 -3.85 -3.84 -3.84 -3.83 -3.83 -3.82 -3.82 -3.81 -3.81 -3.81 -3.80 -3.80 -3.79 -3.79 -3.79 -3.78 -3.78 -3.78 -3.77 -3.77 -3.76 -3.76 -3.76 -3.76 -3.75 -3.75 -3.75 -3.74 -3.74 -3.73 -3.73 -3.73 -3.73 -3.73 -3.72 -3.72 -3.71 -3.71 -3.71 -3.71

Dev objf: -11.06 -11.06 -4.96 -4.63 -4.51 -4.40 -4.36 -4.30 -4.29 -4.26 -4.23 -4.23 -4.21 -4.17 -4.17 -4.16 -4.15 -4.14 -4.13 -4.13 -4.12 -4.10 -4.09 -4.09 -4.09 -4.08 -4.08 -4.06 -4.07 -4.05 -4.06 -4.05 -4.04 -4.04 -4.02 -4.04 -4.02 -4.01 -4.02 -4.01 -4.01 -4.00 -4.01 -4.00 -4.01 -3.99 -3.98 -3.98 -3.97 -3.97 -3.98 -3.98 -3.98 -3.98 -3.97 -3.97 -3.97 -3.97 -3.96 -3.96 -3.96 -3.95 -3.96 -3.96 -3.95 -3.95 -3.95 -3.94 -3.94 -3.95 -3.94 -3.94 -3.95 -3.93 -3.93 -3.94 -3.93 -3.93 -3.93 -3.93 -3.93

Daniel Povey

unread,

Feb 23, 2018, 2:35:25 PM2/23/18

to kaldi-help

Hm. Possibly l2 regularization would help. E.g you could add

tdnn_opts="l2-regularize=0.001"

lstm_opts="l2-regularize=0.0005"

output_opts="l2-regularize=0.0005"

and add $tdnn_opts to the 'relu-renorm-layer' lines and $lstm_opts to the 'fast-lstmp-layer' lines and $output_opts to the 'output-layer' lines.

However I have no idea whether these are suitable values; they are what we use for acoustic model training but they may be totally the wrong order of magnitude for this.

If the regularization is high, the model will change too fast: e.g., if you grep for Relative in the progress.log files, you'll see substantially higher numbers than your current run. If it's too small it won't make much difference.

Dan

--
Go to http://kaldi-asr.org/forums.html find out how to join
---
You received this message because you are subscribed to the Google Groups "kaldi-help" group.

To unsubscribe from this group and stop receiving emails from it, send an email to kaldi-help+unsubscribe@googlegroups.com.

To post to this group, send email to kaldi...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/kaldi-help/9b8aa3e4-9cdb-47e8-8544-2e5a3b3eb86a%40googlegroups.com.

Tony Robinson

unread,

Feb 26, 2018, 4:34:09 AM2/26/18

to kaldi...@googlegroups.com

I did quite a bit of the work on the TEDLIUM LM training data, it came from the Google billion word language model task which in turn is "derived from the training-monolingual.tokenized/news.20??.en.shuffled.tokenized data distributed at http://statmt.org/wmt11/translation-task.html". Assuming this chain is still being what is used, then the source data is sentence shuffled so randomising the order again shouldn't make any difference.

Tony

[ sorry if some fake anti-spoof message gets prepended to this email, I don't know why it happens ]

--

Tony Robinson

CTO, Speechmatics

Mobile: +44 (0)7808 165099 | LinkedIn | Twitter

www.speechmatics.com

Visit us at MWC 2018, 26th February – 1st March. Stand 7B73.

To view this discussion on the web visit https://groups.google.com/d/msgid/kaldi-help/fd43e643-a84e-4f2b-98a4-e71fde1621b0%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Private & Confidential
Speechmatics is the trading name of Cantab Research Limited (Reg. in UK: 05697423). This email is strictly confidential and intended solely for the addressee(s). If you are not the intended recipient of this email you must: (i) not disclose, copy or distribute its contents to any other person nor use its contents in any way or you may be acting unlawfully; (ii) contact Speechmatics immediately on the sender's email address, then delete it from your system.

David

unread,

Feb 26, 2018, 5:32:58 AM2/26/18

to kaldi-help

Out of academic interest, has anyone used this rnnlm (or others with letter features) to deal with capitalised training texts. Three approaches I can think of are.

1) Treating say 'b' and 'B' independently is the obvious approach. Thus buffalo and Buffalo would share a commonality of many features, but not all, though we effectively double the number of letters which has a cost.

2) Ignoring case altogether and relying on the underlying ngram to determine case. Both don't necessarily need capitalisation and it could improve training robustness and maybe improve orthogonality between two.

3) Mapping letter ngram features to lower case only and introducing a new capitalization feature (or features), say num-prefix-capitals which would be 0 for 'buffalo', 1 for Buffalo or 4 for NATO. Thought this might make the feature space more concise and help robustness to real world LM training data (as opposed to Corpus transcriptions) where capitalisation is inconsistently applied.

Would be interested if anyone has compared these approaches or indeed has tried anything like the latter.

David Pye

unread,

Mar 5, 2018, 6:59:12 AM3/5/18

to kaldi-help

I've been running more experiments in the background regarding the instability issue above.

Firstly, as over-fitting looked to be an issue, I ran with the same model schema but padded out with ten times more data. Disappointingly this did not resolve the instability and performance again deteriorated after a few tens of words.

Secondly, I tried converting the text to be purely lower case. The resulting rnnlm models performed as well as I originally expected and were perfectly stable -- performance improved (until convergence) as left context increases. This was true using Fisher data alone as well as the larger padded training set. My model builds still included parameter settings from previous attempts to remedy the problem; i.e. including an aggressive decay time of 5 and chunk length of 64. Running the smaller model build without the decay time restriction surprisingly (to me) made performance a little worse, though not unstable.

I'm not sure why capitalization proved problematic for me. I'd guess its due to a doubled letter-space sparsity issue rather than something intrinsic to capitalization in the code. Ideally I would have experimented with L2 regularization to keep the weights in check but I haven't the time and resources at the moment to determine sensible values for the hyper-parameters.

Thanks for previous suggestions,

David

Daniel Povey

unread,

Mar 6, 2018, 2:51:40 PM3/6/18

to kaldi-help

I've been running more experiments in the background regarding the instability issue above.

Firstly, as over-fitting looked to be an issue, I ran with the same model schema but padded out with ten times more data. Disappointingly this did not resolve the instability and performance again deteriorated after a few tens of words.

Regarding capitalization and sparsity: I think it would make sense to augment the feature-extraction code so that it would somehow treat capitals in a way that preserves their connection with the corresponding lower-case letters. The options here are quite complicated though.

I wonder if you could have some kind of data mismatch, whereby your test data differs systematically in capitalization from your training data.

Secondly, I tried converting the text to be purely lower case. The resulting rnnlm models performed as well as I originally expected and were perfectly stable -- performance improved (until convergence) as left context increases. This was true using Fisher data alone as well as the larger padded training set. My model builds still included parameter settings from previous attempts to remedy the problem; i.e. including an aggressive decay time of 5 and chunk length of 64. Running the smaller model build without the decay time restriction surprisingly (to me) made performance a little worse, though not unstable.

I'm not sure why capitalization proved problematic for me. I'd guess its due to a doubled letter-space sparsity issue rather than something intrinsic to capitalization in the code. Ideally I would have experimented with L2 regularization to keep the weights in check but I haven't the time and resources at the moment to determine sensible values for the hyper-parameters.

I looked at the progress logs you sent (progress.70.log, for a run that had about 80 iterations, for the lower case and upper case models). I didn't see anything that different, between the two setups. The speed of learning for the various layers was about right (0.05 is a good value to se; you can ignore the lstm_nonlin's, they only contain the diagonal peephole parameters):

LOG (nnet3-show-progress[5.3]:main():nnet3-show-progress.cc:150) Relative parameter differences per layer are [ tdnn1.affine:0.0314104 lstm1.W_all:0.0568102 lstm1.lstm_nonlin:0.00843273 lstm1.W_rp:0.060017 tdnn2.affine:0.0657314 lstm2.W_all:0.0566665 lstm2.lstm_nonlin:0.00231254 lstm2.W_rp:0.052529 tdnn3.affine:0.0624975 output.affine:0.0626047 ]

However, I did see something that wasn't quite right, although it was the same for both of your models.

Notice below that the "self-repaired-proportion" is 0.9, while it shouldn't normally be more than about 0.02 or so. And look at the percentiles of "deriv-avg": most of these are zero or 1, meaning that the bulk of the relu's at the output of tdnn1 are either "always off" or "always on". It's therefore not really learning anything useful, and what's more, it's giving a rather tight bottleneck of dimension 500 or so (since only about half are "on"), while your embedding dim is 1024. So you've lost half your embedding dim, for purposes of the network input.

component name=tdnn1.relu type=RectifiedLinearComponent, dim=1024, self-repair-scale=1e-05, count=2.11e+06, self-repaired-proportion=0.900339, value-avg=[percentiles(0,1,2,5 10,20,50,80,90 95,98,99,100)=(0,0,0,0 0,0,0.03,0.73,1.2 1.5,1.9,2.1,3.8), mean=0.366, stddev=0.541], deriv-avg=[percentiles(0,1,2,5 10,20,50,80,90 95,98,99,100)=(0,0,0,0 0,0,0.36,1.0,1.0 1.0,1.0,1.0,1.0), mean=0.482, stddev=0.483]

Basically the fix for this is to increase "self-repair-scale". You can add to the xconfig line (it's fine to add it to all lines, including LSTM and output lines, I think):

self-repair-scale=1.0e-04

This will make the mechanism 10 times stronger than it is by default, and will help nudge the parameters back to "good" values. Once you fix this you may find your network overfits, and it may be helpful to reduce the embedding dim a bit (IIRC we normally use about 600, rather than 1024).

Dan

Out of academic interest, has anyone used this rnnlm (or others with letter features) to deal with capitalised training texts. Three approaches I can think of are.

1) Treating say 'b' and 'B' independently is the obvious approach. Thus buffalo and Buffalo would share a commonality of many features, but not all, though we effectively double the number of letters which has a cost.

2) Ignoring case altogether and relying on the underlying ngram to determine case. Both don't necessarily need capitalisation and it could improve training robustness and maybe improve orthogonality between two.

3) Mapping letter ngram features to lower case only and introducing a new capitalization feature (or features), say num-prefix-capitals which would be 0 for 'buffalo', 1 for Buffalo or 4 for NATO. Thought this might make the feature space more concise and help robustness to real world LM training data (as opposed to Corpus transcriptions) where capitalisation is inconsistently applied.

Would be interested if anyone has compared these approaches or indeed has tried anything like the latter.

David Pye

--

Go to http://kaldi-asr.org/forums.html find out how to join
---
You received this message because you are subscribed to the Google Groups "kaldi-help" group.

To unsubscribe from this group and stop receiving emails from it, send an email to kaldi-help+unsubscribe@googlegroups.com.

To post to this group, send email to kaldi...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/kaldi-help/4310ebd1-359b-46b4-ba57-14ca183dd928%40googlegroups.com.

Reply all

Reply to author

Forward