How to create a dataset for Huggingface models

Huggingface.co is a platform for deep learning models that provides a lot of functionality for easy handling of these models across their lifecycles. Among this functionality is the Huggingface datasets library, which enables us to use shared datasets or create custom ones. Huggingface models accept data as objects of the Dataset class in the datasets library. These objects store data in key-value pairs, in a dictionary-like structure. The datasets available at the Huggingface Hub can be loaded by passing their name and configuration tags to the load_dataset() method of the Dataset class. On the other hand, we have multiple options available for creating datasets. The most straightforward of these is loading a dataset from a pandas DataFrame using the from_pandas() method in the Dataset class.

The from_pandas() method

The from_pandas() method in the Dataset class of the Huggingface datasets library enables us to load each row in a pandas DataFrame, as an item that can be accessed by keys in the following way:

data[0]["audio"]["array"]

The from_pandas() method accepts the following arguments:

  • df: This is the pandas DataFrame you want to load the data from.

  • split: This is the name of the dataset split that you want to create, i.e., train, test, or validation.

We can also add files such as audio clips to the Dataset object by including the path to the file for each row in a column of the DataFrame, using from_pandas() to convert the DataFrame into a Dataset instance, and then using the cast_column method to cast the column to feature. The cast_column() method accepts the following argument:

  • column: This is the column name in the Dataset object.

  • feature: This is the feature we want to convert it to.

Code example

Suppose that we’re working on a speech recognition project. We have audio files in a folder named Audio, along with a text file that contains transcriptions for the files in the following format:

This device has a cathode inside an anode wire cage.
This product is almost always produced by the industrialized method.
It is named after Edward Singleton Holden.
It is north west of the regional centre of Clare.
He was a nephew of Rear-Admiral Sir Francis Augustus Collier.
Leaving for some darn camp in Mississippi.
While employed in this role, Johnson won the prestigious Robert F. Kennedy Award.
transcriptions.txt

Run the following widget to see the folder structure of the prepared dataset.

# Click RUN to see the folder structure

In the following code widget, we’ll write code for loading the dataset that has been described above.

import pandas as pd
import os
from datasets import Audio, Dataset
# Creating a pandas DataFrame
mypath = '/ASR_dummy_en/'
audio_folder = '/Audio/'
train_dataset = pd.DataFrame({})
for fol in os.listdir(mypath):
dataset = pd.DataFrame({})
print("checking:\t", fol)
if os.path.isdir(mypath+fol):
try:
dataset['audio'] = [mypath+fol+audio_folder+f for f in os.listdir(mypath+fol+audio_folder) if os.path.isfile(os.path.join(mypath+fol+audio_folder, f))]
dataset['sentence'] = pd.read_csv(mypath+fol+'/transcriptions.txt', header=None)[0].apply(str)
dataset['path'] = dataset['audio']
print('concatenating\t', len(dataset), len(train_dataset))
train_dataset = pd.concat([train_dataset, dataset], axis=0)
except Exception as E:
print("Error at:\n", str(E))
# Load the dataset and cast column to feature
data = Dataset.from_pandas(train_dataset, split="train")
data = data.cast_column("audio", Audio(sampling_rate=16_000))
print('Sample data row:\n', data[0])

Code explanation

  • Line 1–3: We import the relevant libraries.

  • Line 6–7: We set the folder names.

  • Line 8: We create the pandas DataFrame to load the dataset from. The data will be accumulated into the train_dataset DataFrame.

  • Line 10: We iterate through the folders in the dataset.

  • Line 11–21: We create an empty DataFrame dataset, and iterate through the folder. If it’s a valid directory, we populate it with the file paths and the transcriptions. We prepare the audio column for loading audio clips to the Dataset instance; this column will contain the paths of the audio files.

  • Line 24: We create a Dataset instance, and load data to it using the from_pandas() method.

  • Line 25: We cast the audio column to feature using the cast_column() method.

  • Line 26: We check if the dataset has been correctly loaded by using the command data[0]. If the dataset has been correctly loaded, it will return something like the following:

{'audio': {'path': '/ASR_dummy_en/141231/Audio/1272-141231-0017.flac',
'array': array([-5.79833984e-04, -3.66210938e-04, -7.01904297e-04, ...,-2.44140625e-04, -6.10351562e-05, -3.05175781e-05]),
'sampling_rate': 16000},
'sentence': '1272-141231-0000 A MAN SAID TO THE UNIVERSE SIR I EXIST',
'path': '/ASR_dummy_en/141231/Audio/1272-141231-0017.flac', '__index_level_0__': 0}
Sample row in the dataset

Note: We can concatenate multiple Datasets using the concatenate_datasets() method in the Huggingface datasets library. This method will accept multiple Dataset objects as a list, and will return the concatenated Dataset.

Free Resources

Copyright ©2025 Educative, Inc. All rights reserved