You can convert your speech to text by using speech recognition software or APIs, such as Google speech-to-text or Microsoft’s Azure Speech Service.
Key takeaways:
Wav2Vec2 enables speech-to-text conversion through self-supervised learning on raw audio data.
The model predicts masked speech units, refining its understanding of spoken language.
It achieves low word error rates (WER) on noisy and clean audio, outperforming its predecessor.
Wav2Vec2 learns essential speech units, focusing on relevant audio information.
Using CNN and transformers generates latent spoken language understanding audio representations for transcription.
Cross-lingual pretraining helps the model perform well even with limited labeled data.
Wav2Vec2 performs robustly on benchmarks, even with minimal labeled data.
Speech-to-text conversion is a popular field of natural language processing that involves converting spoken speech to text. Speech-to-text transcription has many real-world applications, including personal virtual assistance and automated customer service. For this Answer, we’ll see how to achieve this task using Facebook’s open-source Wav2Vec2 model. This model extracts representations from raw audio data using self-supervised learning.
With self-supervision, Wav2Vec2 pushes its boundaries and learns from unannotated training data. This model is trained to make predictions of the correct speech units for masked audio parts. Simultaneously, it also learns what each speech unit represents.
Wav2Vec2 gives a word error rate (WER) of 8.6% on noisy speech and 5.2% with clean audio on the standard LibriSpeech benchmark dataset, outperforming its first version. Thus, it can provide speech recognition systems for different languages and dialects.
Unlike conventional speech recognition systems, which are trained on a large dataset of labeled speech audio, Wav2Vec2 learns the most relevant speech units, smaller than phenomes, representing the speech audio. This differs from traditional speech-to-text systems, which model every aspect of the speech audio, which includes the background noise and traits of the speaker.
The model uses a multilayer convolutional neural network (CNN) to derive 25ms long latent audio representations. These are, in turn, passed to the quantizer and the transformer. From a group of learned speech units, the quantizer chooses one for the latent audio representation. Masked audio representations are then passed on to the transformer, which extracts information from the audio sequence. The transformer’s output identifies each masked position’s correct quantized speech units.
The entire architecture is illustrated below:
Sometimes, we face a technical impediment because the unlabelled data is scarce. In this case, we prefer to pre-train our model on multiple languages rather than a single one. This is because the model can benefit from related languages while making predictions. This technique works well for Wav2Vec2 in learning speech units that are common across similar or polar opposite languages.
The table below shows the performance of the Wav2vec2 model when it was finetuned on 100 hours, 1 hour, and 10 minutes of labeled data on the Libri-Speech benchmark dataset. This model was first pretrained on the Libri-Speech benchmark dataset. We observe an improvement from Noisy student—the previous state-of-the-art model—to the original Wav2Vec2 model when trained on the same 100 hours of training data.
If the unannotated data on which the model was trained increased in size, even then, the model would perform well. As proof, the model was trained on 53k hours of unannotated data taken from the LibriVox dataset. It was later finetuned on 10 minutes of labeled data. This resulted in a WER of 8.6%. Thus, speech recognition can still be enabled with limited labeled data.
Wav2Vec2 Model | WER (Word Error Rate %) |
Noisy student 100h | 8.6 |
Wav2Vec2 100h labeled data | 5 |
Wav2Vec2 1h labeled data | 7.6 |
Wav2Vec2 10m labeled data | 10.8 |
Wav2Vec2 10m (53K unannotated) | 8.6 |
The following example shows how to obtain a speech transcription using the pretrained Wav2Vec2 model. Run the widget below and make changes in the launched Jupyter Notebook file.
!pip install -q transformers !pip install librosa !pip install torch import librosa import torch from transformers import Wav2Vec2ForCTC, Wav2Vec2Tokenizer educative_audio, educative_rate = librosa.load("/usr/local/notebooks/hardvard.wav", sr = 16000) tokeniz_educative = Wav2Vec2Tokenizer.from_pretrained("facebook/wav2vec2-base-960h") model_educative = Wav2Vec2ForCTC.from_pretrained("facebook/wav2vec2-base-960h") my_audio_input_values = tokeniz_educative(educative_audio, return_tensors = "pt").input_values educatives_logits = model_educative(my_audio_input_values).logits educatives_prediction = torch.argmax(educatives_logits, dim = -1) transcription = tokeniz_educative.batch_decode(educatives_prediction)[0] print(transcription)
Lines 1–3: To start, we need to install the necessary libraries; we’ll install transformers
(for acquiring pretrained models), Librosa
(Python package for audio analysis), and torch
.
Lines 5–7: After installing these libraries, we will import them to use them inside our notebook file. For speech-to-text conversion tasks, we’ll import the Wav2Vec2ForCTC
and Wav2Vec2Tokenizer
.
Lines 9–12: We’ll load the audio file, hardvard.wav
whose speech we want to be transcribed, in the variable educative_audio
. We’ll set the sampling rate as 16000
Hz. tokeniz_educative
will be initialized with the Wav2Vec2Tokenizer
- it inherits from the pretrained Wav2Vec2 model and can call some of its methods. Line 11 initializes a Wav2Vec2ForCTC
by using a pretrained model. The last tokenizes the audio file into units that can be fed to the model using the Wav2vec2 tokenizer.
Lines 14–18: The first line of this code section passes the input tokens to the Wav2Vec2ForCTC
and fetches the raw predictions called logits. Then, we obtain the predicted class from each timestamp and store them in educatives_prediction
. The predictions are then converted to human-readable text and stored in the transcription
variable displayed using the code on the last line.
Wav2Vec2 utilizes self-supervised learning for speech-to-text conversion by processing the audio file with convolutional layers and then imposing a transformer to amplify the speech representation. Wav2Vec2 can learn speech representations on a large unannotated dataset to get the desired results. Thus reducing the amount of labeled data required for training.
Further reading:
Haven’t found what you were looking for? Contact Us
Free Resources