Reusing/merging egs from two different runs

201 views
Skip to first unread message

Miguel Jette

unread,
Oct 26, 2017, 5:39:33 PM10/26/17
to kaldi-help
Hi there,

I have trained a large DNN AM (chain nnet3 tdnn_lstm), and kept the 'egs' folder.
I since now have new data I want to use to train a new DNN (chain nnet3 tdnn_lstm). In fact, I want to merge the new data to my "old" data and re-train the DNN (one more iteration perhaps, or completely new training).

I have reproduced the lats folder and the egs folder for my new small data set.
I have been able to merge the lats, but how can I merge the egs? 

I realize I could re-create them by merging the lats, and running the DNN training script, but it takes a long time and a lot of space. I'd rather skip that if possible.

Thank you!
Miguel

Daniel Povey

unread,
Oct 26, 2017, 6:20:15 PM10/26/17
to kaldi-help
Hm.  So I assume you'll be starting from the same tree and will be using the same ivector extractor, otherwise this is impossible.
I don't believe there is a mechanism that currently exists to do this, but in principle what you could do is just dump egs for the new data, and manually create the data structure of the egs directory for the combined data, by somehow interleaving the new and old egs and making soft links.  You might have to decide whether to take the training and validation data subsets from the new or old egs dir- not critical.
And you'd set info/num_archives to the sum of the two source num_archives.
I might check in a script to do all this, if you made one.



Dan


--
Go to http://kaldi-asr.org/forums.html find out how to join
---
You received this message because you are subscribed to the Google Groups "kaldi-help" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kaldi-help+unsubscribe@googlegroups.com.
To post to this group, send email to kaldi...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/kaldi-help/3787ce3a-bcfc-470b-ae5a-c2a2b8105c95%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Miguel Jette

unread,
Oct 26, 2017, 7:18:11 PM10/26/17
to kaldi-help
Ok, I think I could do that.
I am definitely starting from the same tree and ivector extractor.

I'm not sure what you mean by "somehow interleaving the new and old egs and making soft links".
I will take a deeper look tomorrow.

Thanks for your hints,
Miguel


On Thursday, 26 October 2017 18:20:15 UTC-4, Dan Povey wrote:
Hm.  So I assume you'll be starting from the same tree and will be using the same ivector extractor, otherwise this is impossible.
I don't believe there is a mechanism that currently exists to do this, but in principle what you could do is just dump egs for the new data, and manually create the data structure of the egs directory for the combined data, by somehow interleaving the new and old egs and making soft links.  You might have to decide whether to take the training and validation data subsets from the new or old egs dir- not critical.
And you'd set info/num_archives to the sum of the two source num_archives.
I might check in a script to do all this, if you made one.



Dan

On Thu, Oct 26, 2017 at 5:39 PM, Miguel Jette <miguel...@gmail.com> wrote:
Hi there,

I have trained a large DNN AM (chain nnet3 tdnn_lstm), and kept the 'egs' folder.
I since now have new data I want to use to train a new DNN (chain nnet3 tdnn_lstm). In fact, I want to merge the new data to my "old" data and re-train the DNN (one more iteration perhaps, or completely new training).

I have reproduced the lats folder and the egs folder for my new small data set.
I have been able to merge the lats, but how can I merge the egs? 

I realize I could re-create them by merging the lats, and running the DNN training script, but it takes a long time and a lot of space. I'd rather skip that if possible.

Thank you!
Miguel

--
Go to http://kaldi-asr.org/forums.html find out how to join
---
You received this message because you are subscribed to the Google Groups "kaldi-help" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kaldi-help+...@googlegroups.com.

Daniel Povey

unread,
Oct 26, 2017, 9:23:09 PM10/26/17
to kaldi-help
By "somehow interleaving" all I meant is that it might not be optimal to give the archives of egs (cegs.*.ark) a numbering where, for instance, archives 1 through 60 are from dataset 1, and archives 61 through 78 are from dataset 2.  If there is a significant difference between the datasets, you'd get a kind of oscillation every epoch.  So you might want to use a numbering that mixes them together (a randomized numbering would be fine).

Dan


To unsubscribe from this group and stop receiving emails from it, send an email to kaldi-help+unsubscribe@googlegroups.com.

To post to this group, send email to kaldi...@googlegroups.com.

Miguel Jette

unread,
Oct 30, 2017, 9:38:23 AM10/30/17
to kaldi-help
Hi Dan,

I finally set aside some time to write this.
Unfortunately, I just noticed that the *egs_per_archive* are not the same in my two folders. I imagine that's a problem.

cat old_experiment/exp/chain/tdnn_lstm1a_bi/egs/info/egs_per_archive
9377
cat new_experiment/exp/chain/tdnn_lstm1a_bi/egs/info/egs_per_archive
9368

Everything else seems to be compatible. I'm not sure why they differ. Maybe I can re-groups the egs from the "new experiment" somehow?

Thank you for your guidance,

Miguel

You received this message because you are subscribed to a topic in the Google Groups "kaldi-help" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/kaldi-help/YczuCFcIZx0/unsubscribe.
To unsubscribe from this group and all its topics, send an email to kaldi-help+unsubscribe@googlegroups.com.

To post to this group, send email to kaldi...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.



--

Daniel Povey

unread,
Oct 30, 2017, 11:45:01 AM10/30/17
to kaldi-help
That's not a problem-- that information is only really used for diagnostics.  Just use one of them, or if you want to be anal, you can use the average or weighted average, but it's not important.


Reply all
Reply to author
Forward
0 new messages