What is train/test/validation set splitting?

Train, test, and validation set splitting is a main concept in machine learning. It involves gathering or collecting a dataset and dividing it into three categories: the training, validation, and test sets. Each of these subsets has its specific use in the machine-learning workflow. It’s helpful in training machine learning models effectively, tuning hyperparameters, and evaluating the performance of machine learning models.

Training dataset: It’s used to teach or train a machine learning model. The machine learning model learns or extracts features from that dataset and adjusts its parameters accordingly. It’s the largest part of the dataset and contains diverse data to train our model on different visuals or scenarios.
Validation dataset: It’s considered a part of the training data but is not used for training. It’s used during the training of a machine learning model to fine-tune model hyperparameters to prevent overfitting.
Testing dataset: This dataset is used to test/evaluate the machine learning model. Once the model is trained on training data, testing data is used to find the model’s accuracy or performance.

Splitting of the dataset

As we already discussed, splitting our data into different sets helps us judge the performance of the machine learning model. There is no optimal ratio of dataset splitting; however, the splitting ratio depends on the quantity of data and model requirements.

If we have small data to train the machine learning model, it might perform well on training examples but not generalize new examples, which is known as the overfitting problem.
If the model has fewer or no hyperparameters to tune, then we can have a small portion of data in the validation set or even not have a validation set. Similarly, if the model is complex and has several hyperparameters to tune, then we need to have a larger validation set.
If the dataset is imbalanced, meaning one of the classes has fewer or more examples, then improper dataset splitting degrades the model’s performance.

Techniques for data splitting

Numerous techniques are used to split datasets into training, testing, and validation sets. Some of them are mentioned below:

Random sampling

The random sampling method is the most common method used in dataset splitting. The first step is to shuffle the dataset and then randomly split the data into train, test, and validation sets based on the splitting ratio. This method works fine with class-balanced datasets but has a significant drawback with class-imbalanced datasets. Here’s an example of random sampling using scikit-learn.

Stratified sampling

The stratified sampling method addresses the problem of imbalanced class distribution in datasets. For example, we have a dataset of 500 images and two objects, vehicles and persons. The dataset contains 300 images (60%) of vehicles and 200 images (40%) of persons. The stratified sampling method ensures that each set contains the same ratio (60% vehicles and 40% person).

K-fold cross-validation

As the name suggests, K-fold cross-validation splits the dataset for cross-validation purposes. It splits the dataset into K sets and trains the model K times. It uses K-1 sets for training and one (1) set for testing each time. This technique eliminates bias as the model is exposed to different data distributions.

Splitting the dataset into these three sets helps prevent overfitting, assess the accuracy of a model’s performance, and ensure that it’s robust and reliable when applied to real-world scenarios. There is no optimal splitting ratio, but some common ratios used for training, validation, and testing are 60–20–20, 70–15–15, and 80–10–10.

Free AI Mock Interviews

Coding Interview

Coding PatternsFree Interview

Gain insights and practical experience with coding patterns through targeted MCQs and coding problems, designed to match and challenge your expertise level.

System Design

You TubeFree Interview

Learn to design a video streaming platform like YouTube by tackling functional and non-functional requirements, core components, and high-level to detailed design challenges.

Free Resources