Hello,
I am a complete novice in the ASR field and I'm looking to develop a model using the TIMIT dataset. However, I'm finding the data preprocessing and labeling aspects quite confusing.
My ultimate goal is to use the preprocessed TIMIT dataset as input to an LSTM model to predict sentences.
I've seen examples of training the Wav2Vec model on GitHub and various communities (
https://happy-obok.tistory.com/69). I came to think that such preprocessing steps might not be suitable for an LSTM model. Through research, I found out about extracting MFCC features, but I'm unsure how to apply this approach. I would greatly appreciate any guidance on this matter.
Thank you for taking the time to help.