How to use autoencoders for audio recognition

Key takeaways:

  • Autoencoders are unsupervised neural networks that compress and reconstruct data, commonly used for tasks like anomaly detection, denoising, and unsupervised feature learning.

  • It consists of an encoder, which compresses input data into a latent representation, and a decoder, which reconstructs the original data. Variants like VAEs and GANs also use these principles for generative tasks.

  • To recognize audio, autoencoders can compress audio snippets and later match them to their corresponding files in the dataset. The encoder generates latent representations, which can be compared using cosine similarity.

  • The process involves defining MFCC features, normalizing them, and using the encoder model to compress and recognize audio. The test audio snippet is matched to the audio file in the dataset with the highest cosine similarity value.

An autoencoder is an unsupervised neural network that seeks to match its inputs to outputs, efficiently compressing and encoding data before reconstructing it to closely resemble the original input. Notably, they are trained at anomaly detection, identifying deviations from learned representations as potential irregularities. In image processing, autoencoders demonstrate proficiency in denoisingDenoising refers to the process of removing noise or unwanted artifacts from a signal or dataset, enhancing its clarity or quality. by reconstructing clean images from noisy versions. Their role extends to unsupervised feature learning, extracting meaningful patterns from raw data without labeled examples. Additionally, autoencoders reduce dimensionality, facilitating efficient data storage and visualization.

Understanding autoencoders

The diagram below illustrates the architecture of an autoencoder, a type of artificial neural network used for unsupervised learning. It consists of an input layer, encoding layers that compress the input data into a latent representation, and decoding layers that reconstruct the data back into the original input format at the output layer.

Architecture of an autoencoder
Architecture of an autoencoder

Applications of autoencoders

Artificial neural network models like variational autoencoders (VAEs) and generative adversarial networks (GANs) use autoencoder principles for generative tasks. Their utility spans recommendation systems, speech recognition, time series prediction, medical image analysis, and natural language processing (NLP), showcasing their versatility in various machine learning applications.

We can use the autoencoder design to recognize any sort of data by providing a smaller chunk of the data. For example, we can provide the autoencoder with a small chunk of an image and utilize our autoencoder to tell us which image the smaller chunk belongs to in our dataset. Similarly, we can do the same for audio data. We can use the autoencoder architecture to identify which audio file the provided audio snippet closely resembles in our dataset.

Matching audio chunks using encoder part of the autoencoder
Matching audio chunks using encoder part of the autoencoder

Implementing autoencoders for audio recognition

To employ autoencoders for audio recognition, start by designing an autoencoder with an encoder-decoder architecture to compress and reconstruct the audio snippets. Train the autoencoder to minimize reconstruction error using techniques like mean squared error. Then, separate the encoder and train the dataset again on the encoder to obtain the dataset’s features. We will then pass the audio snippet to the encoder and calculate the cosine similarity for the latent representation of the audio snippet and the latent representations of the files in our dataset.

Building an encoder for latent audio representation

Firstly, we need to create an encoder that will help us convert our audio files into their equivalent latent representations. To do this, we will first create our autoencoder architecture and train it on our dataset. Our dataset consists of 10–15-second audio chunks from various music pieces, and our goal is to be able to match an audio chunk to the music piece it belongs to. Next, we will separate the encoder and use it to get the latent representation of our dataset:

from extract_mfcc import extract_mfcc
from keras.layers import Input, Dense
from keras.models import Model
import librosa
import numpy as np
import os
audio_path = '<PATH TO TRAINING FILES HERE>'
dir = os.listdir(audio_path)
x_values = []
for files in dir:
if files[0] != '.':
mfcc = extract_mfcc(f'{audio_path}/{files}')
x_values.append(mfcc)
x_values = np.array(x_values)
num_mfcc_features = x_values.shape[1]
# Normalize the data (assuming mean and std normalization)
mean = np.mean(x_values, axis=0)
std = np.std(x_values, axis=0)
X_train_normalized = (x_values - mean) / std
# Defining an autoencoder
input_shape = (num_mfcc_features,)
encoding_dim = 32
input_layer = Input(shape=input_shape)
encoded = Dense(128, activation='relu')(input_layer)
encoded = Dense(encoding_dim, activation='relu', name='encoding_layer')(encoded)
decoded = Dense(128, activation='relu')(encoded)
decoded = Dense(num_mfcc_features, activation='linear')(decoded)
autoencoder = Model(input_layer, decoded)
autoencoder.compile(optimizer='adam', loss='mean_squared_error')
# Extract the encoder part of the autoencoder
encoder = Model(inputs=autoencoder.input, outputs=autoencoder.get_layer('encoding_layer').output)
encoded_data = encoder.predict(x_values)
np.save('LRDB.npy', encoded_data)
encoder.save('encoder_model.h5')

Explanation

Let’s see some explanation of the above code:

  • Line 8: Specifies where to insert the directory path of the training audio files.

  • Lines 9–10: Retrieves the list of files in the specified directory and initializes an empty list to store the MFCC feature vectors.

  • Lines 12–15: Loops through the files in the directory, checks if the file is not a hidden one (not starting with a dot), extracts MFCC features using the extract_mfcc function and appends them to the list.

  • Line 17: Converts the list of MFCC features into a NumPy array for efficient computation.

  • Line 18: Determines the number of features in the MFCC by examining the shape of the array.

  • Lines 21–22: Calculates the mean and standard deviation of the MFCC features across all examples.

  • Line 23: Normalizes the MFCC feature vectors by subtracting the calculated mean and dividing by the standard deviation.

  • Lines 26–33: Defines the architecture of an autoencoder, including the input layer’s shape, the size of the encoding dimension, the encoder layers with ReLU activation, and the decoder layers with linear activation to reconstruct the input.

  • Line 35: Instantiates the autoencoder model by specifying the input and output layers.

  • Line 36: Compiles the autoencoder model with the Adam optimizer and mean squared error as the loss function, making it ready for training.

  • Line 39: Creates a separate model for the encoder by extracting the relevant part from the full autoencoder.

  • Line 40: Uses the encoder model to predict and compress the normalized MFCC feature vectors.

  • Lines 42–43: Saves the encoded data as a NumPy file and the encoder model as an HDF5 file, which can be used later for further processing or inference.

Performing audio recognition using latent representation

Now, we can use our encoder to get the latent representation for the new input snippet to calculate the cosine similarity for the saved latent representation of the trained input and the test input. The file having the maximum similarity indicates that the test snippet belongs to that particular audio file.

'1 (1)_chunk2.wav'
'1 (4)_chunk7.wav'
'1 (4)_chunk6.wav'
'1 (1)_chunk3.wav'
'1 (1)_chunk1.wav'
'1 (4)_chunk4.wav'
'1 (3)_chunk8.wav'
'1 (4)_chunk5.wav'
'1 (1)_chunk4.wav'
'1 (4)_chunk1.wav'
'1 (1)_chunk5.wav'
'1 (1)_chunk7.wav'
'1 (4)_chunk2.wav'
'1 (4)_chunk3.wav'
'1 (1)_chunk6.wav'
'1 (2)_chunk6.wav'
'1 (2)_chunk7.wav'
'1 (2)_chunk5.wav'
'1 (2)_chunk4.wav'
'1 (2)_chunk1.wav'
'1 (2)_chunk3.wav'
'1 (2)_chunk2.wav'
'1 (5)_chunk3.wav'
'1 (5)_chunk2.wav'
'1 (5)_chunk1.wav'
'1 (5)_chunk5.wav'
'1 (2)_chunk8.wav'
'1 (5)_chunk4.wav'
'1 (5)_chunk6.wav'
'1 (3)_chunk2.wav'
'1 (3)_chunk3.wav'
'1 (1)_chunk8.wav'
'1 (3)_chunk1.wav'
'1 (3)_chunk4.wav'
'1 (3)_chunk5.wav'
'1 (3)_chunk7.wav'
'1 (3)_chunk6.wav'
Performing audio recognition

Code explanation

  • Lines 7–13: We defined a function extract_mfcc() that will help us extract the MFCC features from our test audio chunks

  • Lines 28–30: Here, we extract the MFCC values from our input data so that we can create our dataset to pass to our encoder model

  • Lines 33–35: Finally, we pass the extracted MFCC features to our encoder and calculate the cosine_similarity between the latent representation of the trained data and the latent representation for our test audio and the test point belongs to the audio file with which it has the greatest cosine similarity value.

Conclusion

In conclusion, by training the autoencoder to minimize reconstruction error using mean squared error, we can create a model capable of compressing and reconstructing audio snippets effectively. Subsequently, extracting latent representations from the trained encoder enables us to compare audio snippets using cosine similarity, offering a robust method for audio recognition within our dataset.

Frequently asked questions

Haven’t found what you were looking for? Contact Us


How does an autoencoder work for anomaly detection?

An autoencoder is trained to compress and reconstruct input data. During training, it learns to minimize reconstruction errors for normal data. When anomalous data is fed into the model, it fails to reconstruct the input accurately, resulting in a higher reconstruction error. This reconstruction error is used as a measure to identify anomalies.


What are the practical applications of autoencoders?

Autoencoders have several practical applications, including:

  • Anomaly detection in network traffic, healthcare, and manufacturing.
  • Dimensionality reduction for data visualization and preprocessing.
  • Image denoising by removing noise while preserving important details.
  • Generating synthetic data and feature extraction for machine learning models.
  • Recommender systems for capturing latent features in user preferences.

Which algorithm is best for anomaly detection?

The best algorithm for anomaly detection depends on the use case:

  • Autoencoders for high-dimensional data like images and sequences.
  • Isolation Forests for tabular data with outliers.
  • K-means clustering for simple clustering-based anomalies.
  • One-Class SVM for cases with limited labeled anomalies.
  • Time-series models (like LSTM or ARIMA) for detecting anomalies in sequential data.

Free Resources

Copyright ©2025 Educative, Inc. All rights reserved