Huggingface.co is a platform for deep learning models that provides a lot of functionality for easy handling of these models across their lifecycles. Among this functionality is the Huggingface datasets
library, which enables us to use shared datasets or create custom ones. Huggingface models accept data as objects of the Dataset
class in the datasets
library. These objects store data in key-value pairs, in a dictionary-like structure. The datasets available at the Huggingface Hub can be loaded by passing their name and configuration tags to the load_dataset()
method of the Dataset
class. On the other hand, we have multiple options available for creating datasets. The most straightforward of these is loading a dataset from a pandas
DataFrame using the from_pandas()
method in the Dataset
class.
from_pandas()
methodThe from_pandas()
method in the Dataset
class of the Huggingface datasets
library enables us to load each row in a pandas
DataFrame, as an item that can be accessed by keys in the following way:
data[0]["audio"]["array"]
The from_pandas()
method accepts the following arguments:
df
: This is the pandas
DataFrame you want to load the data from.
split
: This is the name of the dataset split that you want to create, i.e., train
, test
, or validation
.
We can also add files such as audio clips to the Dataset
object by including the path to the file for each row in a column of the DataFrame, using from_pandas()
to convert the DataFrame into a Dataset
instance, and then using the cast_column
method to cast the column to feature. The cast_column()
method accepts the following argument:
column
: This is the column name in the Dataset
object.
feature
: This is the feature we want to convert it to.
Suppose that we’re working on a speech recognition project. We have audio files in a folder named Audio
, along with a text file that contains transcriptions for the files in the following format:
This device has a cathode inside an anode wire cage.This product is almost always produced by the industrialized method.It is named after Edward Singleton Holden.It is north west of the regional centre of Clare.He was a nephew of Rear-Admiral Sir Francis Augustus Collier.Leaving for some darn camp in Mississippi.While employed in this role, Johnson won the prestigious Robert F. Kennedy Award.
Run the following widget to see the folder structure of the prepared dataset.
# Click RUN to see the folder structure
In the following code widget, we’ll write code for loading the dataset that has been described above.
import pandas as pdimport osfrom datasets import Audio, Dataset# Creating a pandas DataFramemypath = '/ASR_dummy_en/'audio_folder = '/Audio/'train_dataset = pd.DataFrame({})for fol in os.listdir(mypath):dataset = pd.DataFrame({})print("checking:\t", fol)if os.path.isdir(mypath+fol):try:dataset['audio'] = [mypath+fol+audio_folder+f for f in os.listdir(mypath+fol+audio_folder) if os.path.isfile(os.path.join(mypath+fol+audio_folder, f))]dataset['sentence'] = pd.read_csv(mypath+fol+'/transcriptions.txt', header=None)[0].apply(str)dataset['path'] = dataset['audio']print('concatenating\t', len(dataset), len(train_dataset))train_dataset = pd.concat([train_dataset, dataset], axis=0)except Exception as E:print("Error at:\n", str(E))# Load the dataset and cast column to featuredata = Dataset.from_pandas(train_dataset, split="train")data = data.cast_column("audio", Audio(sampling_rate=16_000))print('Sample data row:\n', data[0])
Line 1–3: We import the relevant libraries.
Line 6–7: We set the folder names.
Line 8: We create the pandas
DataFrame to load the dataset from. The data will be accumulated into the train_dataset
DataFrame.
Line 10: We iterate through the folders in the dataset.
Line 11–21: We create an empty DataFrame dataset
, and iterate through the folder. If it’s a valid directory, we populate it with the file paths and the transcriptions. We prepare the audio
column for loading audio clips to the Dataset
instance; this column will contain the paths of the audio files.
Line 24: We create a Dataset
instance, and load data to it using the from_pandas()
method.
Line 25: We cast the audio
column to feature using the cast_column()
method.
Line 26: We check if the dataset has been correctly loaded by using the command data[0]
. If the dataset has been correctly loaded, it will return something like the following:
{'audio': {'path': '/ASR_dummy_en/141231/Audio/1272-141231-0017.flac','array': array([-5.79833984e-04, -3.66210938e-04, -7.01904297e-04, ...,-2.44140625e-04, -6.10351562e-05, -3.05175781e-05]),'sampling_rate': 16000},'sentence': '1272-141231-0000 A MAN SAID TO THE UNIVERSE SIR I EXIST','path': '/ASR_dummy_en/141231/Audio/1272-141231-0017.flac', '__index_level_0__': 0}
Note: We can concatenate multiple
Datasets
using theconcatenate_datasets()
method in the Huggingfacedatasets
library. This method will accept multipleDataset
objects as a list, and will return the concatenatedDataset
.
Free Resources