Banking on Insights: Predicting Customer Churn
- nganhahv99
- Sep 24, 2023
- 2 min read
Updated: Nov 16, 2023

Business Motivation
The main objective is to help financial institutions predict customer churn before it happened and to timely take actions to retain customers.
Project Overview
- Employed classification model (logistic regression) to predict customer churn for a commercial bank 
- Deployed unsupervised learning (k-means clustering) to identify and understand customer segments 
- Preprocessed data by encoding categorical features and reducing dimensionality (PCA) 
- Performed exploratory data analysis (EDA) to understand correlation between variables and gather summary statistics 

Main Insights & Recommendations
- Account balance and active status are major predictors of churn 
- Each customer segment needs targeted retention strategies 
- High-value customer segment makes major contribution to profit 
Resources Used
Software: SAS
Dataset Source: Kaggle
Table of Contents:
a. K-means Clustering
b. Logistic Regression
I. Data Preprocessing
Dataset contains 10,000 rows and 13 variables listed in table below

First, I encoded categorical variables (Geography, Gender) so that we can perform classification on the data.
data project.data; set project.data;
if Gender = 'Female' then Gender_dummy = 1; else Gender_dummy = 0;
if Geography = 'France' then France = 1; else France = 0;
if Geography = 'Spain' then Spain = 1; else Spain = 0;
if Geography = 'Germany' then Germany = 1; else Germany = 0;
run;Second, I performed Principal Component Analysis to reduce dimensionality of the data and remove multicollinearity, enabling future modeling like K-means Clustering amd Logistic Regression to run more efficiently.
/* PCA */
proc princomp data=project.data out=project.pca;
var &xvar;
From the results above, the first eight principal components can already explain ~80% of variation in the data.
II. Exploratory Data Analysis
Next, I gathered summary statistics of the variables incl. means, standard deviations, min and max values.
proc means data=project.data;
 var &xvar exited;
run;
In addition, I plotted a correlation matrix to detect multicollinearity in the dataset.

Here are some takeaways from the matrix:
- Positive correlation between customer's bank balance (Balance) and customers from Germany (Germany).
- Positive correlation between whether or not the customer will churn (Exited) and age of the customer (Age) i.e. older customers are more likely to close their accounts.
- Negative correlation between bank balance (Balance) and number of products the customer used (NumOfProducts).
III. Model Building
a. K-means Clustering
I standardized 12 predictor variables and performed k-means clustering with k = 5

Deeper dive into the 5 clusters:


- Cluster 1 (Switchables): Have the highest churn rate ~90%
- Cluster 2 (Loyalists): Have the lowest churn rate ~10%
- Clusters 3, 4 & 5 (Apathetics & Generalists): Generally not very high churn rate (15% - 30%)
b. Logistic Regression
I chose logistic regression for our classification model choice due to its interpretability, making it easier to generate actionable insights for stakeholders.
First, to ensure no overfitting and enable us to later evaluate model performance, I split the the data into training and testing sets at a 8:2 ratio.
/* Randomly split the data into 2 datasets with sampling rate be .80 */
PROC SURVEYSELECT DATA=project.data OUT=project.split METHOD=SRS
SAMPRATE=0.80
OUTALL SEED=12345 NOPRINT;
RUN;We will train the model on training set and calculate out-of-sample (OOS) performance of the model on testing set.
Model formulation includes 11 variables:

IV. Model Performance
Results for Logistic Regression model:
+) Log-likelihood = 6840

+) Out-of-sample AUC = 0.7526

This indicates that the model had relatively good performance.