How to perform feature engineering for customer churn prediction

Key takeaways:

  • Customer churn prediction aims to identify customers likely to discontinue using a product or service by analyzing historical data to find patterns and behaviors that precede churn.

  • Feature engineering is crucial. Transforming raw data into meaningful features improves model performance and helps identify key attributes influencing customer churn.

  • Accurate churn prediction allows businesses to implement targeted retention strategies, such as personalized offers or improved customer service.

Initially, we might assume that predicting customer churn is as simple as applying machine learning algorithms to existing data. However, the real power lies in feature engineering—the process of transforming raw data into meaningful features that help improve model performance. Feature engineering plays a pivotal role in identifying key attributes that influence customer churn, and it’s crucial to not only define which type of feature engineering is being applied but also clarify the expected output.

Customer churn prediction is the task of identifying customers who are likely to stop using a product or service in the future. First, we look at past customer data to identify patterns and behaviors that often occur before a customer leaves (churns). Next, we pinpoint the specific actions or changes that signal a customer might be at risk of leaving. Finally, we use machine learning algorithms to predict which customers are likely to churn based on these identified patterns and behaviors.

The goal of customer churn prediction is to take proactive measures to retain customers before they churn, such as targeted marketing campaigns, personalized offers, or improved customer service. By accurately predicting customer churn, businesses can reduce customer attrition, increase customer satisfaction, and ultimately improve their bottom line.

Guide to perform feature engineering

Performing data preparation and feature engineering for customer churn prediction involves several steps. Here’s the step-by-step process for customer churn prediction:

Import libraries

We import essential libraries for data analysis and visualization in Python, including Pandas, NumPy, Seaborn, and Matplotlib, with inline plotting enabled.

# Import Library
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
Importing required libraries

Load dataset

We load a CSV file named “churn.csv” located in the specified path churn.csv, which includes RowNumber, CustomerId, Surname, CreditScore, Geography, Gender, Age, Tenure, Balance, NumOfProducts, HasCrCard, IsActiveMember, and Exited attributes and 10000 number of rows. We display the first few rows of the DataFrame using the head() function.

df=pd.read_csv('churn.csv')
df.head()
Loading churn dataset

Here’s the output:

First few rows of chrun dataset
First few rows of chrun dataset

Data preprocessing

In this step, we see the statistics of the churn dataset.

df.describe()
Show statistics of churn dataset

Here’s the output:

Statistics of churn dataset
Statistics of churn dataset

After that, we count the missing values in the churn dataset. Counting missing values is important because it helps us understand the quality of our data and identify any gaps that might need to be addressed before analysis.

df.isna().sum()
Counting missing values

Here’s the output:

Missing values of churn dataset in each column
Missing values of churn dataset in each column

Now, we’ll display the column names of the dataset.

df.columns
Displaying column names of dataset

Here’s the output:

Column names of the churn dataset
Column names of the churn dataset

After counting missing values in the churn dataset, we show the box plots of numerical variablesNumerical variables could include features like credit score, age, tenure, and balance. by churn status.

# Box plots of numerical variables by churn status
column_names=['CreditScore','Age','Tenure','Balance']
fig, ax = plt.subplots(2, 2, figsize=(15, 15))
# Populate each subplot with a boxplot
for i, subplot in zip(column_names, ax.flatten()):
sns.boxplot(x='Exited', y=i, data=df, ax=subplot)
# Display the plot
plt.show()
Data preprocessing of churn dataset

The output of the above code represents the distribution of four numerical variables—CreditScore, Age, Tenure, and Balance—segmented by the Exited variable, which likely indicates whether a customer has churnedLeft the service (Exited = 1) or not (Exited = 0). Here’s a breakdown of what each plot indicates:

  1. CreditScore vs. Exited: The box plot for CreditScore shows that customers who have not exited (Exited = 0 They are still active and using the service.) tend to have higher credit scores (the median is higher), while those who have exited (Exited = 1They have stopped using the service or product.) seem to have lower credit scores. There are some outliers (points that fall outside the whiskers), especially for the exited group, which could indicate customers with very low or high credit scores.

CreditScore vs Exited
CreditScore vs Exited
  1. Age vs. Exited: The Age box plot reveals that the median age for customers who did not exit is lower than those who exited. The Exited = 1 group shows a wider range of ages, including some older customers (indicated by higher outliers), suggesting that older customers are more likely to churn.

Age vs Exited
Age vs Exited
  1. Tenure vs. Exited: The Tenure box plot indicates that customers who did not exit tend to have a higher average tenure (the median is closer to 6), whereas the Exited group has a slightly lower median tenure. The range for the Exited group is also wider, with a few extreme values.

Tenure vs Exited
Tenure vs Exited
  1. Balance vs. Exited: For Balance, the customers who did not exit have a higher median balance compared to those who exited, suggesting that higher account balances might be associated with lower churn. The plot shows the presence of outliers in both groups, though more extreme values are found in the Exited group.

Balance vs Exited
Balance vs Exited

Outlier removal

Outliers are data points that significantly differ from the rest and can negatively impact the performance of machine learning models. Removing outliers ensures the dataset is clean and produces accurate results. To detect outliers, we use statistical methods like percentiles, which divide data into 100 equal parts to show its spread. For example, the 25th percentile (lower quartile) marks where 25% of the data lies below, and the 75th percentile (upper quartile) marks where 75% lies below. Using np.percentile, we calculate these values and determine the interquartile range (iqr = quartile75 - quartile25), representing the “normal range” of data.

Outliers are identified as values that fall outside the thresholds:

  • Minimum = 25th percentile - 1.5 × IQR

  • Maximum = 75th percentile + 1.5 × IQR

Rows with values below the minimum or above the maximum are filtered out. After removing outliers, box plots of numerical variables by churn status are updated to reflect the cleaned dataset.

# Removing outliers
for col in column_names:
quartile75, quartile25 = np.percentile(df[col], [75 ,25])
iqr = quartile75 - quartile25
min = quartile25 - (iqr*1.5)
max = quartile75 + (iqr*1.5)
df=df[(df[col]<max)]
df=df[(df[col]>min)]
# Box plots
fig, ax = plt.subplots(2,2, figsize = (15,15))
for col, subplot in zip(column_names, ax.flatten()):
sns.boxplot(x = 'Exited', y = col , data = df, ax = subplot)
plt.show()
Removing outliers and visualizing box plots
  1. CreditScore: The spread of credit scores is similar for both customers who exited (Exited = 1) and those who stayed (Exited = 0). No noticeable outliers are present after the removal step.

CreditScore after removing outliers
CreditScore after removing outliers
  1. Age: The age distribution shows that customers who exited tend to have a higher median age compared to those who stayed. The outliers seen previously in the Exited = 0 group are now removed, resulting in a cleaner dataset.

Age after removing outliers
Age after removing outliers
  1. Tenure: The tenure variable displays a similar range for both groups, with a relatively uniform distribution. Outliers have been removed, ensuring a more compact representation.

Tenure after removing outliers
Tenure after removing outliers
  1. Balance: Both groups show a similar distribution of balances, with the majority of values concentrated below a specific range. No extreme values remain after outlier removal.

Balance after removing outliers
Balance after removing outliers

Encoding categorical variables

We encode categorical variables Geography and Gender into numerical labels using LabelEncoder from scikit-learn, because they are the non-numerical features in the dataset, so that machine learning models can process and analyze them effectively. Then we display the first few rows of the DataFrame with the transformed columns. Finally, it prints the classes that were encoded for Geography and Gender.

from sklearn.preprocessing import LabelEncoder
le=LabelEncoder()
df['Geography']=le.fit_transform(df['Geography'])
df['Gender']=le.fit_transform(df['Gender'])
le.classes_
df.head()
Encoding categorical variables

Here, Gender 0 represents Female and 1 represents Male, while Geography is numerically encoded (e.g., 0 for France, 1 for Germany, 2 for Spain). This transformation makes categorical data usable for machine learning models.

Output after encoding categorical variables
Output after encoding categorical variables

Heatmap of the dataset

We create a figure with a heatmap visualization of the correlation matrix of the DataFrame df. The heatmap displays the correlation matrix, which helps identify which features are strongly correlated with each other or with the target variable. This insight is crucial for feature selection and understanding the data’s structure, guiding the model-building process.

plt.figure(figsize=(8,8))
sns.heatmap(df.corr(), cmap='Blues', annot=True)
plt.show()
Visualizing heatmap of the dataset

The heatmap of the churn dataset is shown below:

Heatmap of the churn dataset
Heatmap of the churn dataset

The correlation heatmap reveals key insights into the relationships between features and the target variable, Exited. Age shows a moderate positive correlation with Exited, suggesting older customers are more likely to leave, while Gender and Geography have weak correlations with the target. The balance and number of products are moderately negatively correlated, indicating customers with fewer products tend to have higher balances. Additionally, HasCrCard has a slight negative correlation with Exited, implying credit card holders may be less likely to exit. These insights are valuable for feature selection and model building, highlighting which variables to prioritize based on their strength of correlation with the target.

Countplot of categorical features

We create subplots to visualize the count of categorical features (Gender, HasCrCard, IsActiveMember) segmented by the Exited variable. By creating these plots, we can observe patterns or trends in the categorical features, such as how many male vs. female customers exited or how the status of having a credit card affects customer churn. This step helps identify any significant relationships between categorical features and the target variable, aiding in feature selection and understanding the factors influencing customer behavior.

fig, ax = plt.subplots(1,3, figsize = (10,6))
categorical_features=['Gender','HasCrCard','IsActiveMember']
for col, subplot in zip(categorical_features, ax.flatten()):
sns.countplot(x = col, hue="Exited", data = df, ax = subplot)
plt.show()
Visualizing countplots of categorical features

In the output bar plots:

  • The blue bars represent customers who did not exit (Exited = 0), and the orange bars represent customers who exited (Exited = 1).

  • Gender: In the first plot, we can observe that most of the customers are male (represented by 0 in the “Gender” column), and relatively fewer women (represented by 1) left the service. The proportion of women who left is much smaller compared to men.

  • HasCrCard: In the second plot, most customers who have a credit card (1) are still with the service, but a significant number of customers without a credit card (0) left.

  • IsActiveMember: The third plot shows that active members (1) are less likely to leave, as seen from the higher blue bars for IsActiveMember = 1.

Countplots of categorical features
Countplots of categorical features

Try it yourself

Click the “Run” button and then click the link provided under the “Run” button to open the Jupyter Notebook.

Please note that the notebook cells have been pre-configured to display the outputs
for your convenience and to facilitate an understanding of the concepts covered. 
You are encouraged to actively engage with the material by changing the 
variable values. 
Performing feature engineering for customer churn prediction

Frequently asked questions

Haven’t found what you were looking for? Contact Us


How do you predict customer churn rate?

Customer churn rate is predicted by analyzing past customer behavior, transaction history, and engagement patterns. Techniques like logistic regression, decision trees, or machine learning models assess factors such as usage frequency, complaints, and demographic data to determine the likelihood of a customer leaving.


What is the best model for customer churn prediction?

There is no universally “best” model, but common effective models include logistic regression, random forests, and gradient boosting models. The choice depends on the dataset, business requirements, and the need for interpretability versus accuracy.


What algorithm is used in customer churn prediction?

Common algorithms for customer churn prediction include logistic regression, decision trees, random forests, and neural networks. These algorithms analyze key indicators of customer behavior and identify patterns associated with churn.


Free Resources

Copyright ©2025 Educative, Inc. All rights reserved