Machine learning is a subset of artificial intelligence that involves training computers on large data sets in order to make predictions or decisions. It involves the collection of data, pre-processing it to make it suitable for training a model, model training, evaluating the model, improving it, and making predictions out of the model.
Once the model gets trained on a data set, we can save it using Python's pickle
module that implements binary protocols to serialize and deserialize objects into byte streams. We use the term "pickling" when an object is converted into a byte stream, whereas we use "unpickling" when we convert a byte stream to an object.
In this Answer, we will be training a simple machine learning model, and saving and loading it so that we can make predictions out of it in the future.
We will be using the following technologies:
Pandas: We will be using the pandas library for converting loading the data set into a data frame.
Sklearn: We will use sklearn's (Python's machine learning library) RandomForestClassifier
model and train it on our data set.
Pickle: We will use it to save our model and load it again in our program code.
pickle
moduleIn this section, we will be going through a step-by-step process in which we will:
Load a dataset
Split the dataset into x (features) and y (output) data frames.
Perform train test split (training data = 80%, test data = 20%)
Import the RandomForestClassifier
model and train it in the training data.
Save the model as a binary file with .pk1
file extension.
Load the saved model and perform predictions.
To apply the machine learning model, first, we need to have data set. In this Answer, we have the heart-disease-dataset.csv
file that contains information related to heart diseases. The code to read the CSV file is given below:
import pandas as pd heart_disease_df = pd.read_csv("heart-disease-dataset.csv") print(heart_disease_df.head())
Line 1: We import the pandas library
Line 3: Using the pandas library, we read the CSV file using the read_csv
function. The function reads the CSV file and converts it into pandas data frame.
Line 5: We print the first five rows of the data frame using the head
function.
Now that we have loaded our data set in our program, we split the data set into features and output. The output of the data set is the "target" column, which tells whether a person has heart disease or not. Once this splitting is done, we have to perform further splitting in which we have to split our data into training and testing data. The code for the splitting is given below:
import pandas as pd import numpy as np from sklearn.model_selection import train_test_split heart_disease_df = pd.read_csv("heart-disease-dataset.csv") print(heart_disease_df.head()) x = heart_disease_df.drop("target" , axis = 1) y = heart_disease_df['target'] np.random.seed(0) x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.2) print("x shape:", x.shape) print("y shape:", y.shape) print("x_train shape:", x_train.shape) print("x_test shape:", x_test.shape) print("y_train shape:", y_train.shape) print("y_test shape:", y_test.shape)
Line 2: We import the NumPy library to fix the random seed to 0 in line 12.
Line 3: We import the train_test_split
function from the sklearn.model_selection
package.
Line 9: We remove the "target" column from the loaded data set using the drop
function and store the result as x
(features).
Line 10: We extract the "target" column from the data set and save it as y
(output).
Line 13: We pass x
and y
to the train_test_split
function that splits them into train and test data depending on the test_size
.
Line 15–20: We print the shapes of the data for confirmation.
Till now, we have split our data into testing and training data. We will pass the training data to our RandomForestClassifier
model and calculate the model's accuracy on the test data. The code is given below:
import pandas as pd import numpy as np from sklearn.model_selection import train_test_split from sklearn.ensemble import RandomForestClassifier heart_disease_df = pd.read_csv("heart-disease-dataset.csv") x = heart_disease_df.drop("target" , axis = 1) y = heart_disease_df['target'] np.random.seed(0) x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.2) model = RandomForestClassifier() model.fit(x_train, y_train) model_accuracy = model.score(x_test, y_test) print("Model Accuracy:" , model_accuracy * 100 , "%")
Line 4: We import the RandomForestClassifier
model from the sklearn.ensemble
library.
Line 14: We create an object of the RandomForestClassifier
model.
Line 16: We fit
method.
Line 18: We evaluate the model by passing the test data to the score
method.
Line 20: We display the accuracy on the screen.
Congratulations! We have created a classifying model using sklearn
Now that we have trained our model, we will save it using Python's pickle
module. The code for it is given below:
import pandas as pd import numpy as np from sklearn.model_selection import train_test_split from sklearn.ensemble import RandomForestClassifier import pickle heart_disease_df = pd.read_csv("heart-disease-dataset.csv") x = heart_disease_df.drop("target" , axis = 1) y = heart_disease_df['target'] np.random.seed(0) x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.2) model = RandomForestClassifier() model.fit(x_train, y_train) model_accuracy = model.score(x_test, y_test) print("Model Accuracy:" , model_accuracy * 100 , "%") pickle.dump(model , open('heart-disease-model.pk1' , 'wb'))
Line 5: We import the pickle
module.
Line 23: We save the model using the dump
function provided by the pickle
module. The function takes two parameters:
1st parameter: The object/model that is to be saved.
2nd parameter: The method to save the file. We use the open
function that takes in the file name (heart-disease-model.pk1
) and the mode for opening the file (wb
).
Note: Use the ls
command in the terminal to view the saved file.
Running the above code will save the model in a binary file that can be shared and used by loading it, as we will do in the next section.
The code to load the saved model is given below:
import pandas as pd import numpy as np from sklearn.model_selection import train_test_split from sklearn.ensemble import RandomForestClassifier import pickle heart_disease_df = pd.read_csv("heart-disease-dataset.csv") x = heart_disease_df.drop("target" , axis = 1) y = heart_disease_df['target'] np.random.seed(0) x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.2) model = RandomForestClassifier() model.fit(x_train, y_train) model_accuracy = model.score(x_test, y_test) print("Model Accuracy:" , model_accuracy * 100 , "%") pickle.dump(model , open('heart-disease-model.pk1' , 'wb')) loaded_model = pickle.load(open('heart-disease-model.pk1' , 'rb')) loaded_model_accuracy = loaded_model.score(x_test, y_test) print("Loaded Model Accuracy:" , loaded_model_accuracy * 100 , "%")
Line 25: We use the load
function from the pickle
module to load our saved machine-learning model. We use the open
function that takes the file name (heart-disease-model.pk1
) that contains the saved model and the mode for opening the file(rb
).
Line 27: To confirm that the loaded model works, we pass the test data to it, which performs prediction and returns the accuracy score.
Line 29: We display the accuracy of the loaded model.
We have successfully loaded our saved model
pickle
is a useful module that helps to save our model and load it. Saving trained models as binary files helps to share them between teams and systems, without the need to train the model again.
Free Resources