Banking on Insights: Predicting Customer Churn

nganhahv99
Sep 24, 2023
2 min read

Updated: Nov 16, 2023

Business Motivation

The main objective is to help financial institutions predict customer churn before it happened and to timely take actions to retain customers.

Project Overview

Employed classification model (logistic regression) to predict customer churn for a commercial bank
Deployed unsupervised learning (k-means clustering) to identify and understand customer segments
Preprocessed data by encoding categorical features and reducing dimensionality (PCA)
Performed exploratory data analysis (EDA) to understand correlation between variables and gather summary statistics

Main Insights & Recommendations

Account balance and active status are major predictors of churn
Each customer segment needs targeted retention strategies
High-value customer segment makes major contribution to profit

Resources Used

Software: SAS

Dataset Source: Kaggle

Table of Contents:

I. Data Preprocessing

II. Exploratory Data Analysis

III. Model Building

a. K-means Clustering

b. Logistic Regression

IV. Model Performance

I. Data Preprocessing

Dataset contains 10,000 rows and 13 variables listed in table below

First, I encoded categorical variables (Geography, Gender) so that we can perform classification on the data.

data project.data; set project.data;
if Gender = 'Female' then Gender_dummy = 1; else Gender_dummy = 0;
if Geography = 'France' then France = 1; else France = 0;
if Geography = 'Spain' then Spain = 1; else Spain = 0;
if Geography = 'Germany' then Germany = 1; else Germany = 0;
run;

Second, I performed Principal Component Analysis to reduce dimensionality of the data and remove multicollinearity, enabling future modeling like K-means Clustering amd Logistic Regression to run more efficiently.

/* PCA */
proc princomp data=project.data out=project.pca;
var &xvar;

From the results above, the first eight principal components can already explain ~80% of variation in the data.

II. Exploratory Data Analysis

Next, I gathered summary statistics of the variables incl. means, standard deviations, min and max values.

proc means data=project.data;
 var &xvar exited;
run;

In addition, I plotted a correlation matrix to detect multicollinearity in the dataset.

Here are some takeaways from the matrix:

- Positive correlation between customer's bank balance (Balance) and customers from Germany (Germany).

- Positive correlation between whether or not the customer will churn (Exited) and age of the customer (Age) i.e. older customers are more likely to close their accounts.

- Negative correlation between bank balance (Balance) and number of products the customer used (NumOfProducts).

III. Model Building

a. K-means Clustering

I standardized 12 predictor variables and performed k-means clustering with k = 5

Deeper dive into the 5 clusters:

- Cluster 1 (Switchables): Have the highest churn rate ~90%

- Cluster 2 (Loyalists): Have the lowest churn rate ~10%

- Clusters 3, 4 & 5 (Apathetics & Generalists): Generally not very high churn rate (15% - 30%)

b. Logistic Regression

I chose logistic regression for our classification model choice due to its interpretability, making it easier to generate actionable insights for stakeholders.

First, to ensure no overfitting and enable us to later evaluate model performance, I split the the data into training and testing sets at a 8:2 ratio.

/* Randomly split the data into 2 datasets with sampling rate be .80 */
PROC SURVEYSELECT DATA=project.data OUT=project.split METHOD=SRS
SAMPRATE=0.80
OUTALL SEED=12345 NOPRINT;
RUN;

We will train the model on training set and calculate out-of-sample (OOS) performance of the model on testing set.

Model formulation includes 11 variables:

IV. Model Performance

Results for Logistic Regression model:

+) Log-likelihood = 6840

+) Out-of-sample AUC = 0.7526

This indicates that the model had relatively good performance.