How to shuffle a dataset in TensorFlow

TensorFlow is an open-source machine learning framework developed by Google and can be used for various machine learning techniques, such as computer vision, natural language processing, etc. While creating a machine learning model, it is necessary to introduce randomness into our datasets to ensure our model doesn’t learn the pattern given in the sample datasets. This can be done by using the shuffle() method available in TensorFlow.

import tensorflow as tf
# Create a dataset
sample_dataset = tf.data.Dataset.range(15)  # Example dataset with numbers from 0 to 14
print("Original Dataset: ", end=" ")
for element in sample_dataset:
    print(element.numpy(), end=" ")
# Shuffle the dataset
shuffled_dataset = sample_dataset.shuffle(buffer_size=10)  # buffer_size specifies the size of the buffer used for shuffling
# Iterate over the shuffled dataset and print the elements
print("\nShuffled Dataset: ", end=" ")
for element in shuffled_dataset:
    print(element.numpy(), end=" ")

Code explanation

In the code given above:

Line 1: We import the TensorFlow library.
Line 4: We create a sample dataset using TensorFlow, including numbers from 0-14.
Lines 6–8: We print the elements in our dataset.
Line 11: We use the shuffle() method available in TensorFlow and provide 10 as the buffer size.
Lines 14–16: We print the shuffled dataset.

Reproducing shuffling order

Sometimes, we may want to reproduce the shuffling order created by the shuffle() method. To do this, we can provide the seed parameter in our shuffle function call, which allows us to reproduce the shuffling process a dataset goes through in each iteration. Let’s see this function in action in the code given below.

import tensorflow as tf
# Create a dataset
sample_dataset = tf.data.Dataset.range(15)  # Example dataset with numbers from 0 to 14
print("Original Dataset: ", end=" ")
for element in sample_dataset:
    print(element.numpy(), end=" ")
# Shuffle the dataset
shuffled_dataset_1 = sample_dataset.shuffle(buffer_size=10, seed=3)  # we define the buffer size and the seed.
shuffled_dataset_2 = sample_dataset.shuffle(buffer_size=10, seed=4) 
shuffled_dataset_3 = sample_dataset.shuffle(buffer_size=10, seed=3) 
# Iterate over the shuffled dataset and print the elements
print("\nShuffled Dataset 1: ", end=" ")
for element in shuffled_dataset_1:
    print(element.numpy(), end=" ")
print("\nShuffled Dataset 2: ", end=" ")
for element in shuffled_dataset_2:
    print(element.numpy(), end=" ")
print("\nShuffled Dataset 3: ", end=" ")
for element in shuffled_dataset_3:
    print(element.numpy(), end=" ")

Code explanation

In the code given above:

Line 1: We import the TensorFlow library.
Line 4: We create a sample dataset, including numbers from 0-14.
Line 11: We use the shuffle() method available in TensorFlow and provide 10 as the buffer size and 3 as the seed value.
Line 12: We use the shuffle() method available in TensorFlow and provide 10 as the buffer size and 4 as the seed value.
Line 13: We use the shuffle() method available in TensorFlow and provide 10 as the buffer size and 3 as the seed value.
Lines 16–26: We print the shuffled datasets.

Once the code above is executed, we’ll note that the order of elements in the shuffled dataset 1 and 3 are the same. This is because they have the same seed value.

Conclusion

To sum up, introducing randomness in a dataset ensures the models we create aren’t trained on specific patterns in the sample datasets. To do this, we can use the shuffle function available in TensorFlow, which randomly shuffles the elements in the dataset.

Unlock your potential: Tensorflow series, all in one place!

To continue your exploration of Tensorflow, check out our series of Answers below:

Implementation of Autoencoder using Tensorflow
Learn how autoencoders efficiently encode and decode data, which is crucial in tasks like dimensionality reduction, denoising, and colorization.
What is TensorFlow object detection model
Learn how TensorFlow's object detection API provides tools for creating and deploying models, featuring pretrained models, customizable training, and diverse application use cases.
PyTorch vs. Tensorflow
Learn how PyTorch is ideal for ease of use and rapid prototyping, while TensorFlow excels in production deployment and scalability for large-scale projects.
How to shuffle a dataset in TensorFlow?
Learn how to use TensorFlow's shuffle() method to introduce randomness in datasets, ensuring models don't learn unintended sample patterns.