Wav2vec2 Xlsr

0 views

Skip to first unread message

Kathrine Selvage

unread,

Aug 3, 2024, 4:21:20 PM8/3/24

to tipafise

I want to train a speech to text model with wav2vec2 xlsr (transformer-based model) in danish language, as a recommendation, many people train their model using common voice with the help of datasets library, but in common voice, there is very less amount of data for danish, now I want to train the model with my own custom data, but I am failed to find any clear documentation for this, can anybody please help me with this, that how can I do it step by step?

I suggest you to extend Common Voice (CV) Danish subset with your own dataset. Analyse dataset first and make your data like CV corpus. At this point: data extension (.wav, .mp3 ...), type (float32, int ...), audio lengths and of course transcription formats are important. Don not make your corpus sparse.

Hi @patrickvonplaten and @tiena2cva,
Thanks for the new official wav2vec2-pretraining example, this helps a lot!
I had the same problem as @tiena2cva. Tried to re-run the demo script with the same parameters on my own gpu. After a few epochs the contrastive loss was decreased to zero and the model stopped changing.
Running inference showed that the quantizer maps all the time steps to the same vector (can be seen at projected_quantized_states), which explains the zero contrastive loss.
I would have thought that the diversity loss weight should be increased, but I used the parameters given in the README file so this behavior is unexpected and may indicate a different problem.

I have the same situation here.
I tried to pretrain wave2vec on 8K and 16K samples. after few steps contrastive loss goes 0, diversity loss shoots up to 1 and perplexity to 2.
I also tried changing hyper params like learning rate and gumble temp, but no luck
any updates ?
@tiena2cva @patrickvonplaten

TLDR: A simple and efficient pruning method for sparse subnetwork discovery from self-supervised pre-trained initializations (wav2vec 2.0/XLSR-53) that can be finetuned to the same downstream low-resource ASR results. See illustration below.

Full Abstract:
Self-supervised speech representation learning (speech SSL) has demonstrated the benefit of scale in learning rich representations for Automatic Speech Recognition (ASR) with limited paired data, such as wav2vec 2.0. We investigate the existence of sparse subnetworks in pre-trained speech SSL models that achieve even better low-resource ASR results. However, directly applying widely adopted pruning methods such as the Lottery Ticket Hypothesis (LTH) is suboptimal in the computational cost needed. Moreover, we show that the discovered subnetworks yield minimal performance gain compared to the original dense network.
We present Prune-Adjust-Re-Prune (PARP), which discovers and finetunes subnetworks for much better performance, while only requiring a single downstream ASR finetuning run. PARP is inspired by our surprising observation that subnetworks pruned for pre-training tasks need merely a slight adjustment to achieve a sizeable performance boost in downstream ASR tasks. Extensive experiments on low- resource ASR verify (1) sparse subnetworks exist in mono-lingual/multi-lingual pre-trained speech SSL, and (2) the computational advantage and performance gain of PARP over baseline pruning methods.
In particular, on the 10min Librispeech split without LM decoding, PARP discovers subnetworks from wav2vec 2.0 with an absolute 10.9%/12.6% WER decrease compared to the full model. We further demonstrate the effectiveness of PARP via: cross-lingual pruning without any phone recognition degradation, the discovery of a multi-lingual subnetwork for 10 spoken languages in 1 finetuning run, and its applicability to pre-trained BERT/XLNet for natural language tasks.

A lesson from the past:
Recent progress in self-supervised speech representations (e.g. the wav2vec series, the Unispeech series, BigSSL, etc) has proven the importance of scaling up the representation modules to attain SOTA ASR performance. Additionally, the SUPERB benchmark shows that regardless of the SSL objective, model scaling is critical across 10 downstream speech tasks. Basically for a given SSL objective, the bigger the model, the better the downstream results are! This leads us to a series of research questions:
Given a self-supervised speech representation framework on the right, such as wav2vec 2.0/XLSR-53, do there exist sparse subnetworks within the self-supervised pre-trained initializations with similar downstream low-resource ASR performance? What are the properties of these sparse subnetworks, and what can we learn from these properties? Beyond applying the pruning methods from the vision community, is there a more effient approach to discover these sparse subnetworks? Our Goal:
Discover sparse subnetworks within pre-trained speech SSL for low-resource ASR such that (1) they are found and finetuned efficiently within 1 ASR finetuning pass, and (2) attain the same or even lower WER compared to original dense speech SSL model. A Surprising Observation:
For any downstream spoken languages, the non-zero ASR pruning masks obtained from task-agnostic subnetwork discovery has high overlaps with those obtained from task-aware subnetwork discovery.
For instance on the right, applying unstructured unstructured pruning (UMP) on a wav2vec2 finetuned for Spanish and applying UMP again on a wav2vec2 finetuned for French result in 97% overlap in their pruning patterns. This holds for any amount of downstream supervision, pre-training model scale, and sparsity.

The flip side of the observation is: task-agnostic pruning/subnetwork provides a good basis for task-dependent pruning/subnetwork. The Algorithm:
Step 1: Language-Agnostic Initial Subnetwork. Directly prune pre-trained SSL such as wav2vec2/XLSR at target sparsity, and obtain an initial subnetwork and an initial pruning mask. Alternatively, prune a non-target language finetuned wav2vec2/XLSR.
Step 2: Language-Aware Subnetwork Adjustment. Finetune the initial subnetwork on target downstream task/language. During finetuning, zero out the pruned weights specified by the pruning mask, but allow the weights be updated by gradient descent during backpropogation. After a few number of model updates, re-prune the updated subnetwork at target sparsity again.

In practice, we repete Step 2 multiple times (within 1 full ASR downstream finetuning). Figure below illustrates the 2 Steps.

Results Summary

Pruning for Low-Resource English ASR:
PARP (black line) out-performs or matches baseline pruning methods in all settings. We also found that at the 10min split w/o LM decoding, sparse subnetworks from wav2vec2 found with PARP attains an absolute 10% WER reduction over the full wav2vec2.

Pruning for Low-Resource Multi-Lingual ASR:
Pruned wav2vec2/XLSR for 10 spoken languages in low-resource conditions: Spanish (es), French (fr), Italian (it), Kyrgyz (ky), Dutch (nl), Russian (ru), Swedish (sv-SE), Turkish(tr), Tatar (tt), and Mandarin (zh-TW). Observe the similar pruning curves across languages, and PARP (black and pink lines) again out-performs or matches baseline pruning methods in all settings.
Cross-Lingual Subnetwork Transfer for wav2vec 2.0:
We investigate the transferability of sparse subnetwork discovered for a source language by finetuning its subnetwork on another target language.
Upper Left: transfer with regular finetuning leads to increase in PER over same-pair (no language mismatch) transfer.
Upper Right: transfer with PARP leads to no increase (sometimes even lower!) in PER over same-pair transfer. For example, a Spanish sparse subnetwork from wav2vec2/XLSR can be efficiently adapted for French w/o PER increase.

Cross-Task Subnetwork Transfer for BERT/XLNet:
We investigate the transferability of sparse subnetwork discovered for a source natural language task from GLUE by finetuning its subnetwork on another natural language task.
This experiment shows the applicability of PARP to natural language domains.

Cross-Task Subnetwork Transfer for wav2vec 2.0:
We investigate the transferability of sparse subnetwork discovered for a source speech task from SUPERB by finetuning its subnetwork on another speech task.
This experiment shows the (potential) applicability of PARP to other speech tasks such as spoken language understanding or speaker recognition.
Pruning as an Alternative for Representation Probing:
We empirically found that pruning/sparsity has a nice correspondance to the quality of representation for downstream tasks.
Top Table: pruned weight localization across layers. We found consistently that middle layers are pruned less, while initial and final layers are pruned much more.
Bottom Plot: plot borrowed from wav2vec-U, where it shows that middle layer representations are more valuable for downstream ASR. A Hypothesis A new insight into self-supervised representation learning:
We hypothesize the existence of downstream task/language-specific weights in self-supervised pre-trained initializations. These important weights only account for very few of the overall model weights, either in pre-trained wav2vec2/XLSR/BERT/XLNet. Therefore, for a given task/language, most of the pruned weights can be obtained "freely" with task-agnostic pruning. The adjustment step then "recovers" the accidentally pruned out important weights by reviving them with gradient updates, and since there are so few of them, the adjustment could be done efficiently. Presentations Video (mp4)

In recent times, advancements in neural models trained using extensive multilingual textual and spoken data have displayed promising potential for enhancing the situation of languages that lack resources. This study is centered on conducting experiments utilizing cutting-edge speech recognition models, specifically Wav2Vec2.0 and Wav2Vec2-XLSR, applied to the Kazakh language. The primary aim of this research is to assess the efficacy of these models in transcribing spoken Kazakh content. Additionally, the investigation seeks to explore the feasibility of leveraging data from other languages for initial training, and to assess whether refining the model with target language data can enhance its performance. As such, this study offers valuable insights into the viability of employing pre-trained multilingual models in the context of underresourced languages. The fine-tuned wav2vec2.0-XLSR model achieved exceptional results, boasting a character error rate (CER) of 1.9 and a word error rate (WER) of 8.9 when evaluated against the test set of the kazcorpus dataset. The outcomes of this analysis hold potential to advance the creation of robust and efficient Automatic Speech Recognition (ASR) systems tailored for the Kazakh language. These developments stand to benefit a range of applications, including speech-to-text translation, voice-activated assistants, and speech-driven communication tools.