How to save a machine learning model using Python's pickle module

Once the model gets trained on a data set, we can save it using Python's pickle module that implements binary protocols to serialize and deserialize objects into byte streams. We use the term "pickling" when an object is converted into a byte stream, whereas we use "unpickling" when we convert a byte stream to an object.

In this Answer, we will be training a simple machine learning model, and saving and loading it so that we can make predictions out of it in the future.

Technologies used

We will be using the following technologies:

Pandas: We will be using the pandas library for converting loading the data set into a data frame.
Sklearn: We will use sklearn's (Python's machine learning library) RandomForestClassifier model and train it on our data set.
Pickle: We will use it to save our model and load it again in our program code.

Saving the model using the `pickle` module

In this section, we will be going through a step-by-step process in which we will:

Load a dataset
Split the dataset into x (features) and y (output) data frames.
Perform train test split (training data = 80%, test data = 20%)
Import the RandomForestClassifier model and train it in the training data.
Save the model as a binary file with .pk1 file extension.
Load the saved model and perform predictions.

Loading the dataset

To apply the machine learning model, first, we need to have data set. In this Answer, we have the heart-disease-dataset.csv file that contains information related to heart diseases. The code to read the CSV file is given below:

Code explanation

Line 1: We import the pandas library
Line 3: Using the pandas library, we read the CSV file using the read_csv function. The function reads the CSV file and converts it into pandas data frame.
Line 5: We print the first five rows of the data frame using the head function.

Train-test split

Now that we have loaded our data set in our program, we split the data set into features and output. The output of the data set is the "target" column, which tells whether a person has heart disease or not. Once this splitting is done, we have to perform further splitting in which we have to split our data into training and testing data. The code for the splitting is given below:

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

heart_disease_df = pd.read_csv("heart-disease-dataset.csv")

print(heart_disease_df.head())

x = heart_disease_df.drop("target" , axis = 1)
y = heart_disease_df['target']

np.random.seed(0)
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.2)

print("x shape:", x.shape)
print("y shape:", y.shape)
print("x_train shape:", x_train.shape)
print("x_test shape:", x_test.shape)
print("y_train shape:", y_train.shape)
print("y_test shape:", y_test.shape)

Code explanation

Line 2: We import the NumPy library to fix the random seed to 0 in line 12.
Line 3: We import the train_test_split function from the sklearn.model_selection package.
Line 9: We remove the "target" column from the loaded data set using the drop function and store the result as x (features).
Line 10: We extract the "target" column from the data set and save it as y (output).
Line 13: We pass x and y to the train_test_split function that splits them into train and test data depending on the test_size.
Line 15–20: We print the shapes of the data for confirmation.

Applying the model

Till now, we have split our data into testing and training data. We will pass the training data to our RandomForestClassifier model and calculate the model's accuracy on the test data. The code is given below:

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier

heart_disease_df = pd.read_csv("heart-disease-dataset.csv")

x = heart_disease_df.drop("target" , axis = 1)
y = heart_disease_df['target']

np.random.seed(0)
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.2)

model = RandomForestClassifier()

model.fit(x_train, y_train)

model_accuracy = model.score(x_test, y_test)

print("Model Accuracy:" , model_accuracy * 100 , "%")

Code explanation

Line 4: We import the RandomForestClassifier model from the sklearn.ensemble library.
Line 14: We create an object of the RandomForestClassifier model.
Line 16: We fitTrain the model our model on the training data using the fit method.
Line 18: We evaluate the model by passing the test data to the score method.
Line 20: We display the accuracy on the screen.

Congratulations! We have created a classifying model using sklearn

Saving the model

Now that we have trained our model, we will save it using Python's pickle module. The code for it is given below:

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
import pickle

heart_disease_df = pd.read_csv("heart-disease-dataset.csv")

x = heart_disease_df.drop("target" , axis = 1)
y = heart_disease_df['target']

np.random.seed(0)
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.2)

model = RandomForestClassifier()

model.fit(x_train, y_train)

model_accuracy = model.score(x_test, y_test)

print("Model Accuracy:" , model_accuracy * 100 , "%")

pickle.dump(model , open('heart-disease-model.pk1' , 'wb'))

Code explanation

Line 5: We import the pickle module.
Line 23: We save the model using the dump function provided by the pickle module. The function takes two parameters:
- 1st parameter: The object/model that is to be saved.
- 2nd parameter: The method to save the file. We use the open function that takes in the file name (heart-disease-model.pk1) and the mode for opening the file (wb).

Note: Use the ls command in the terminal to view the saved file.

Running the above code will save the model in a binary file that can be shared and used by loading it, as we will do in the next section.

Loading the model

The code to load the saved model is given below:

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
import pickle

heart_disease_df = pd.read_csv("heart-disease-dataset.csv")

x = heart_disease_df.drop("target" , axis = 1)
y = heart_disease_df['target']

np.random.seed(0)
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.2)

model = RandomForestClassifier()

model.fit(x_train, y_train)

model_accuracy = model.score(x_test, y_test)

print("Model Accuracy:" , model_accuracy * 100 , "%")

pickle.dump(model , open('heart-disease-model.pk1' , 'wb'))

loaded_model = pickle.load(open('heart-disease-model.pk1' , 'rb'))

loaded_model_accuracy = loaded_model.score(x_test, y_test)

print("Loaded Model Accuracy:" , loaded_model_accuracy * 100 , "%")