Logistic Regression is a supervised machine learning algorithm to predict a binary or a multi-class outcome. Read and learn about logistic regression in this Answer here.
We’ll use logistic regression to predict the survival class of people in the titanic dataset and showcase its implementation in Julia.
import Pkg
Pkg.add("DataFrames")
Pkg.add("RDatasets")
Pkg.add("Lathe")
Pkg.add("GLM")
Pkg.add("StatsBase")
Pkg.add("MLBase")
println("Packages successfully imported")
The DataFrames
library will be used to read data into a DataFrame, while RDatasets
contains the titanic dataset we will use.
The Lathe
library will be used to split our dataset into training and testing sets. We’ll use the GLM
library to create and train our model and obtain predictions on the test set for evaluation.
Lastly, the StatsBase
and MLBase
libraries are used to obtain the model’s accuracy and the confusion matrix on our prediction classes.
This dataset is from the COUNT
category in RDatasets.
using RDatasets
data = dataset("COUNT", "titanic")
first(data,5)
describe
to check the summary of the dataset.describe(data)
There are no missing values, and all our data is of integer type.
Survived
, for imbalanceusing StatsBase
countmap(data.Survived)
Our dataset has a class imbalance, with more data in the Survived Class 0
for non-survivors.
using Lathe.preprocess:TrainTestSplit
train, test = TrainTestSplit(data, .80);
The ;
at the end is used to suppress the output.
using GLM
fm = @formula(Survived ~ Age + Sex + Class)
logit = glm(fm, train, Binomial(), ProbitLink())
This shows the coefficients to the independent variables as well as the intercept. We now have our trained model and our equation.
Predict the target variable on test data
prediction = predict(logit, test);
Convert the probability score to classes.
prediction_class = [if x < 0.5 0 else 1 end for x in prediction];
Have everything in a data frame for ease of comparison.
prediction_df = DataFrame(y_actual = test.Survived,y_predicted = prediction_class,prob_predicted = prediction);
Check correct predictions by comparing them to actual values.
prediction_df.correctly_classified = prediction_df.y_actual .== prediction_df.y_predicted
using StatsBase
accuracy = mean(prediction_df.correctly_classified)*100
println("Accuracy is:",accuracy)
We obtain an accuracy score of 75.36%.
Obtaining the confusion matrix:
using MLBase
confusion_matrix = MLBase.roc(prediction_df.y_actual,prediction_df.y_predicted)
Due to class imbalance, our model predicts the majority class better.
Use the terminal below to practice. Type Julia to start.
Free Resources