How to use Wav2vec2 model for speech to text conversion

Key takeaways:

  • Wav2Vec2 enables speech-to-text conversion through self-supervised learning on raw audio data.

  • The model predicts masked speech units, refining its understanding of spoken language.

  • It achieves low word error rates (WER) on noisy and clean audio, outperforming its predecessor.

  • Wav2Vec2 learns essential speech units, focusing on relevant audio information.

  • Using CNN and transformers generates latent spoken language understanding audio representations for transcription.

  • Cross-lingual pretraining helps the model perform well even with limited labeled data.

  • Wav2Vec2 performs robustly on benchmarks, even with minimal labeled data.

Speech-to-text conversion is a popular field of natural language processing that involves converting spoken speech to text. Speech-to-text transcription has many real-world applications, including personal virtual assistance and automated customer service. For this Answer, we’ll see how to achieve this task using Facebook’s open-source Wav2Vec2 model. This model extracts representations from raw audio data using self-supervised learning.

With self-supervision, Wav2Vec2 pushes its boundaries and learns from unannotated training data. This model is trained to make predictions of the correct speech units for masked audio parts. Simultaneously, it also learns what each speech unit represents.

Wav2Vec2 gives a word error rate (WER) of 8.6% on noisy speech and 5.2% with clean audio on the standard LibriSpeech benchmark dataset, outperforming its first version. Thus, it can provide speech recognition systems for different languages and dialects.

The Wav2Vec2 architecture

Unlike conventional speech recognition systems, which are trained on a large dataset of labeled speech audio, Wav2Vec2 learns the most relevant speech units, smaller than phenomes, representing the speech audio. This differs from traditional speech-to-text systems, which model every aspect of the speech audio, which includes the background noise and traits of the speaker.

The model uses a multilayer convolutional neural network (CNN) to derive 25ms long latent audio representations. These are, in turn, passed to the quantizer and the transformer. From a group of learned speech units, the quantizer chooses one for the latent audio representation. Masked audio representations are then passed on to the transformer, which extracts information from the audio sequence. The transformer’s output identifies each masked position’s correct quantized speech units.

The entire architecture is illustrated below:

Wav2Vec2's architecture
Wav2Vec2's architecture

Cross-lingual training

Sometimes, we face a technical impediment because the unlabelled data is scarce. In this case, we prefer to pre-train our model on multiple languages rather than a single one. This is because the model can benefit from related languages while making predictions. This technique works well for Wav2Vec2 in learning speech units that are common across similar or polar opposite languages.

Wav2Vec2’s performance on public benchmarks

The table below shows the performance of the Wav2vec2 model when it was finetuned on 100 hours, 1 hour, and 10 minutes of labeled data on the Libri-Speech benchmark dataset. This model was first pretrained on the Libri-Speech benchmark dataset. We observe an improvement from Noisy student—the previous state-of-the-art model—to the original Wav2Vec2 model when trained on the same 100 hours of training data.

If the unannotated data on which the model was trained increased in size, even then, the model would perform well. As proof, the model was trained on 53k hours of unannotated data taken from the LibriVox dataset. It was later finetuned on 10 minutes of labeled data. This resulted in a WER of 8.6%. Thus, speech recognition can still be enabled with limited labeled data.

Wav2Vec2 Model

WER (Word Error Rate %)

Noisy student 100h

8.6

Wav2Vec2 100h labeled data

5

Wav2Vec2 1h labeled data

7.6

Wav2Vec2 10m labeled data

10.8

Wav2Vec2 10m (53K unannotated)

8.6

Speech-to-text conversion using the pretrained Wav2Vec2 model

The following example shows how to obtain a speech transcription using the pretrained Wav2Vec2 model. Run the widget below and make changes in the launched Jupyter Notebook file.

!pip install -q transformers
!pip install librosa
!pip install torch

import librosa
import torch
from transformers import Wav2Vec2ForCTC, Wav2Vec2Tokenizer

educative_audio, educative_rate = librosa.load("/usr/local/notebooks/hardvard.wav", sr = 16000)
tokeniz_educative = Wav2Vec2Tokenizer.from_pretrained("facebook/wav2vec2-base-960h")
model_educative = Wav2Vec2ForCTC.from_pretrained("facebook/wav2vec2-base-960h")
my_audio_input_values = tokeniz_educative(educative_audio, return_tensors = "pt").input_values

educatives_logits = model_educative(my_audio_input_values).logits
educatives_prediction = torch.argmax(educatives_logits, dim = -1)
transcription = tokeniz_educative.batch_decode(educatives_prediction)[0]

print(transcription)
Code implemenation for speech-to-text conversion

Code explanation

  • Lines 1–3: To start, we need to install the necessary libraries; we’ll install transformers (for acquiring pretrained models), Librosa (Python package for audio analysis), and torch.

  • Lines 5–7: After installing these libraries, we will import them to use them inside our notebook file. For speech-to-text conversion tasks, we’ll import the Wav2Vec2ForCTC and Wav2Vec2Tokenizer.

  • Lines 9–12: We’ll load the audio file, hardvard.wav whose speech we want to be transcribed, in the variable educative_audio. We’ll set the sampling rate as 16000 Hz. tokeniz_educative will be initialized with the Wav2Vec2Tokenizer - it inherits from the pretrained Wav2Vec2 model and can call some of its methods. Line 11 initializes a Wav2Vec2ForCTC by using a pretrained model. The last tokenizes the audio file into units that can be fed to the model using the Wav2vec2 tokenizer.

  • Lines 14–18: The first line of this code section passes the input tokens to the Wav2Vec2ForCTC and fetches the raw predictions called logits. Then, we obtain the predicted class from each timestamp and store them in educatives_prediction. The predictions are then converted to human-readable text and stored in the transcription variable displayed using the code on the last line.

Conclusion

Wav2Vec2 utilizes self-supervised learning for speech-to-text conversion by processing the audio file with convolutional layers and then imposing a transformer to amplify the speech representation. Wav2Vec2 can learn speech representations on a large unannotated dataset to get the desired results. Thus reducing the amount of labeled data required for training.

Further reading:

Frequently asked questions

Haven’t found what you were looking for? Contact Us


How can I convert my speech to text?

You can convert your speech to text by using speech recognition software or APIs, such as Google speech-to-text or Microsoft’s Azure Speech Service.


What algorithm is used for speech-to-text conversion?

Speech-to-text conversion commonly uses deep learning algorithms like Wav2Vec2, which combines convolutional neural networks (CNNs) and transformers.


How do I transcribe the given speech to text?

Use a pretrained model like Wav2Vec2 with a tokenizer to convert audio input directly into text.


Free Resources

Copyright ©2025 Educative, Inc. All rights reserved