NNET3 with Multi-view features

250 views
Skip to first unread message

Gerardo Roa

unread,
Nov 23, 2019, 3:37:38 PM11/23/19
to kaldi-help
Hi guys,
I've been trying to figure it out how to concatenate different levels of features in a chain configuration using the xconfig.

My task is ASR from unaccompanied singing recordings.

At this time, I have been trying tdnnf experiments using some frame-level features that can be easy concatenated together. Plus ivectors (which, as I understand, are at utterance-level).
For example, if I have 40 MFCC + 3 pitch + 3 jitters, I do

input name=ivector dim=100
input name
=input dim=46
fixed-affine-layer name=lda input=Append(-1,0,1,ReplaceIndex(ivector, t, 0)) affine-transform-file=$dir/configs/lda.mat

But, how can I add another level of features like a song-level ivectors, 
input name=ivector dim=100
input name
=ivector-songlevel dim=50
input name
=input dim=46

how can I concatenate those new features in the input to get something like 
fixed-affine-layer name=lda input=Append(-1,0,1,ReplaceIndex(ivector, t, 0), ReplaceIndex(ivector-songlevel, t, 0)) affine-transform-file=$dir/configs/lda.mat


It is possible to do these and also concatenate even more different levels of features?

Thank you






Daniel Povey

unread,
Nov 24, 2019, 9:40:57 PM11/24/19
to kaldi-help
Yes, that would work fine.
Or you could concatenate them before you give them to the network.

--
Go to http://kaldi-asr.org/forums.html find out how to join
---
You received this message because you are subscribed to the Google Groups "kaldi-help" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kaldi-help+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/kaldi-help/29037201-9077-4e78-824c-1f3100d971b5%40googlegroups.com.

Gerardo Roa

unread,
Nov 25, 2019, 8:41:43 AM11/25/19
to kaldi-help
Hi Dan,
Thank you for your response.
Sadly, just add another ReplaceIndex()  doesn't work. It was the first thing I tried but I got the following error
nnet3-init foo/init.config foo/init.raw 
ERROR (nnet3-init[5.5.309~3-9e9ae]:Parse():nnet-descriptor.cc:620) Expected a Descriptor, got instead offset1

[ Stack-Trace: ]
kaldi::MessageLogger::LogMessage() const
kaldi::MessageLogger::LogAndThrow::operator=(kaldi::MessageLogger const&)
kaldi::nnet3::GeneralDescriptor::Parse(std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const**)
kaldi::nnet3::GeneralDescriptor::ParseReplaceIndex(std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const**)
kaldi::nnet3::GeneralDescriptor::Parse(std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const**)
kaldi::nnet3::GeneralDescriptor::ParseAppendOrSumOrSwitch(std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const**)
kaldi::nnet3::GeneralDescriptor::Parse(std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const**)
kaldi::nnet3::Descriptor::Parse(std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const**)
kaldi::nnet3::Nnet::ProcessOutputNodeConfigLine(int, kaldi::ConfigLine*)
kaldi::nnet3::Nnet::ReadConfig(std::istream&)
main
__libc_start_main
_start

ERROR (nnet3-init[5.5.309~3-9e9ae]:ProcessOutputNodeConfigLine():nnet-nnet.cc:396) Error parsing descriptor (input=...) in config line output-node name=output input=Append(Offset(input, -1), input, Offset(input, 1), ReplaceIndex(offset1, t, 0))

[ Stack-Trace: ]
kaldi::MessageLogger::LogMessage() const
kaldi::MessageLogger::LogAndThrow::operator=(kaldi::MessageLogger const&)
kaldi::nnet3::Nnet::ProcessOutputNodeConfigLine(int, kaldi::ConfigLine*)
kaldi::nnet3::Nnet::ReadConfig(std::istream&)
main
__libc_start_main
_start

kaldi::KaldiFatalError
Traceback (most recent call last):
  File "steps/nnet3/xconfig_to_configs.py", line 333, in <module>
    main()
  File "steps/nnet3/xconfig_to_configs.py", line 327, in main
    existing_model=args.existing_model)
  File "steps/nnet3/xconfig_to_configs.py", line 278, in check_model_contexts
    config_dir, file_name))
  File "steps/libs/common.py", line 158, in execute_command
    p.returncode, command))
Exception: Command exited with status 255: nnet3-init  foo/init.config foo/init.raw


I think I would go and append all the features before presented to the network, I was trying to avoid to unnecessarily repeat that features in the input, and do something more like the ivectors are appended in the fixed-affine-layer.


To unsubscribe from this group and stop receiving emails from it, send an email to kaldi...@googlegroups.com.

Desh Raj

unread,
Nov 25, 2019, 9:15:14 AM11/25/19
to kaldi...@googlegroups.com
This is probably because the "fixed-affine-layer" component can only take inputs named "input" or "ivector", so it throws an error on encountering "ivector-songlevel". You will need to concatenate your "ivector" and "ivector-songlevel" beforehand, using something like paste-feats. Another option might be to use delta-layer (see example here) instead of the LDA layer, but then you'd need to tune the scale to get the best results.

Desh

To unsubscribe from this group and stop receiving emails from it, send an email to kaldi-help+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/kaldi-help/6fff197f-0930-44ee-a391-a3cfe6ddfb64%40googlegroups.com.

Gerardo Roa

unread,
Nov 25, 2019, 11:03:22 AM11/25/19
to kaldi...@googlegroups.com
Thank you so much, Desh.
I didn't realise that the names can only be one of [ "input" , "ivector"], I thought it might accept more names.
I will try the paste-feats and the delta-layer and see which one allows easy modifications to add more features.

Gerardo



--
Gerardo Roa Dabike
Ingeniero Civil Informatico

Daniel Povey

unread,
Nov 25, 2019, 11:03:53 PM11/25/19
to kaldi-help
I think the issue is not the names, it's that by default I think when it estimates that affine transform it only keeps around input descriptors.  If any xconfig layer needs to be kept around for that, you would have to give it an option to be added to init.config as well as ref.config and final.config.  (The names are from memory.) . It's a question of changing `for x in ['ref', 'final']` to `for x in ['ref', 'final', 'init']`, IIRC.

Note: some more recent recipes get rid of the affine transform (since it is a pain to have to estimate it) and just use delta features.


yasufum...@adaptcentre.ie

unread,
Nov 27, 2020, 12:54:33 AM11/27/20
to kaldi-help
Hi,

Could I ask how much extension of code is required to join an utterance level feature with a frame level feature at the xconfig level?
I've been also trying to do this to save storage space, but I'm wondering if it's worth the effort or better to just join those features beforehand using paste-feats and dim-range-component to split them later. 

It seems that once input name "ivector" is changed to another name, a neural network seems to be initialised ill-formed (missing left-context and right-context) and later a python script returns an error.

Best regards,
Yasufumi

Daniel Povey

unread,
Nov 27, 2020, 1:17:44 AM11/27/20
to kaldi-help
Might be easier to join all utterance level features together as 'ivector' and provide as an archive of vectors?
Certain programs may treat 'ivector' specially, certain programs not.

Go to http://kaldi-asr.org/forums.html to find out how to join the kaldi-help group

---
You received this message because you are subscribed to the Google Groups "kaldi-help" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kaldi-help+...@googlegroups.com.

yasufum...@adaptcentre.ie

unread,
Nov 27, 2020, 1:01:42 PM11/27/20
to kaldi-help
Thanks Dan for this advice.

I had to remove some verification steps of ivector data in nnet3-chain-get-egs.cc, since my utterance level features are not computed between every n frames, but other than that treating the utterance level features as ivector seems fine.

Best,
Yasufumi

Jan Trmal

unread,
Dec 2, 2020, 5:08:57 AM12/2/20
to kaldi-help
another idea would be just resample (linear interpolate) the features to have the "proper" sampling frequency. You could use the python kaldi_io to do write out the feature archives, if that would make it easier for you.
y.

Reply all
Reply to author
Forward
0 new messages