How to create a dataset for Huggingface models

Huggingface.co is a platform for deep learning models that provides a lot of functionality for easy handling of these models across their lifecycles. Among this functionality is the Huggingface datasets library, which enables us to use shared datasets or create custom ones. Huggingface models accept data as objects of the Dataset class in the datasets library. These objects store data in key-value pairs, in a dictionary-like structure. The datasets available at the Huggingface Hub can be loaded by passing their name and configuration tags to the load_dataset() method of the Dataset class. On the other hand, we have multiple options available for creating datasets. The most straightforward of these is loading a dataset from a pandas DataFrame using the from_pandas() method in the Dataset class.

The `from_pandas()` method

The from_pandas() method in the Dataset class of the Huggingface datasets library enables us to load each row in a pandas DataFrame, as an item that can be accessed by keys in the following way:

The from_pandas() method accepts the following arguments:

df: This is the pandas DataFrame you want to load the data from.
split: This is the name of the dataset split that you want to create, i.e., train, test, or validation.

We can also add files such as audio clips to the Dataset object by including the path to the file for each row in a column of the DataFrame, using from_pandas() to convert the DataFrame into a Dataset instance, and then using the cast_column method to cast the column to feature. The cast_column() method accepts the following argument:

column: This is the column name in the Dataset object.
feature: This is the feature we want to convert it to.

Code example

Suppose that we’re working on a speech recognition project. We have audio files in a folder named Audio, along with a text file that contains transcriptions for the files in the following format:

import pandas as pd
import os
from datasets import Audio, Dataset
# Creating a pandas DataFrame 
mypath = '/ASR_dummy_en/'
audio_folder = '/Audio/'
train_dataset = pd.DataFrame({})
for fol in os.listdir(mypath):
    dataset = pd.DataFrame({})
    print("checking:\t", fol)
    if os.path.isdir(mypath+fol):
        try:
            dataset['audio'] = [mypath+fol+audio_folder+f for f in os.listdir(mypath+fol+audio_folder) if os.path.isfile(os.path.join(mypath+fol+audio_folder, f))]
            dataset['sentence'] = pd.read_csv(mypath+fol+'/transcriptions.txt', header=None)[0].apply(str)
            dataset['path'] = dataset['audio']
            print('concatenating\t', len(dataset), len(train_dataset))
            train_dataset = pd.concat([train_dataset, dataset], axis=0)
        except Exception as E:
            print("Error at:\n", str(E))
# Load the dataset and cast column to feature
data = Dataset.from_pandas(train_dataset, split="train")
data = data.cast_column("audio", Audio(sampling_rate=16_000))
print('Sample data row:\n', data[0])

Code explanation

Line 1–3: We import the relevant libraries.
Line 6–7: We set the folder names.
Line 8: We create the pandas DataFrame to load the dataset from. The data will be accumulated into the train_dataset DataFrame.
Line 10: We iterate through the folders in the dataset.
Line 11–21: We create an empty DataFrame dataset, and iterate through the folder. If it’s a valid directory, we populate it with the file paths and the transcriptions. We prepare the audio column for loading audio clips to the Dataset instance; this column will contain the paths of the audio files.
Line 24: We create a Dataset instance, and load data to it using the from_pandas() method.
Line 25: We cast the audio column to feature using the cast_column() method.
Line 26: We check if the dataset has been correctly loaded by using the command data[0]. If the dataset has been correctly loaded, it will return something like the following:

New on Educative

Learn to Code

Learn any Language as a beginner

Develop a human edge in an AI powered world and learn to code with AI from our beginner friendly catalog

🏆 Leaderboard

Daily Coding Challenge

Solve a new coding challenge every day and climb the leaderboard

Free Resources

How to create a dataset for Huggingface models

The from_pandas() method

Code example

Code explanation

The `from_pandas()` method