1 Introduction
Create a model capable of estimating the creditworthiness of customers, in order to help the dedicated team understand whether or not to accept a credit card application.
1.1 Features
- ID: customer identification number
- CODE_GENDER: gender of the customer
- FLAGOWNCAR: indicator of car ownership
- FLAGOWNREALTY: indicator of house ownership
- CNT_CHILDREN: number of children
- AMTINCOMETOTAL: annual income
- NAMEINCOMETYPE: type of income
- NAMEEDUCATIONTYPE: level of education
- NAMEFAMILYSTATUS: marital status
- NAMEHOUSINGTYPE: type of dwelling
- DAYS_BIRTH: number of days since birth
- DAYS_EMPLOYED: number of days since employment (if positive, indicates the number of days since being unemployed)
- FLAG_MOBIL: indicator for the presence of a mobile phone number
- FLAGWORKPHONE: indicator of the presence of a work phone number
- FLAG_PHONE: indicator of the presence of a telephone number
- FLAG_EMAIL: indicator for the presence of an email address
- OCCUPATION_TYPE: type of occupation
- CNTFAMMEMBERS: number of family members
- TARGET: variable which is worth 1 if the customer has a high creditworthiness (constant payment of installments), 0 otherwise.
If a customer is denied a credit card, the team must be able to give a reason. This means that your model must provide indications that are easy to interpret.
It is a binary classification which needs a good explainability.
2 Import
2.1 R
2.2 Python
Code
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import pickle
import time
import os
from datetime import date
from dateutil.relativedelta import relativedelta
from sklearn.model_selection import train_test_split, GridSearchCV, RandomizedSearchCV
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder, StandardScaler, MinMaxScaler, LabelBinarizer
from sklearn.pipeline import Pipeline
from sklearn.tree import DecisionTreeClassifier
from xgboost import XGBClassifier
from xgboost import DMatrix
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report
2.3 Config class
class Config():
def __init__(self):
"""
Initialize configuration settings for credit scoring model pipeline.
Sets up file paths, random state, and feature column classifications.
"""
# Base directory and file paths
self.path="/Users/simonebrazzi/R/blog/posts/credit_score/"
self.file="/Users/simonebrazzi/R/blog/posts/credit_score/credit_scoring.csv"
self.pkl_path="~/R/blog/posts/credit_score/pkl/"
# Set random state for reproducibility
self.random_state=42
# Define feature columns by type
self.col_binary = ['code_gender', 'flag_own_car', 'flag_own_realty']
self.col_ordinal = ["name_income_type", "name_education_type", "name_family_status", "name_housing_type", "occupation_type"]
self.col_numeric = ['cnt_children', 'amt_income_total', 'cnt_fam_members']
self.features = self.col_binary + self.col_ordinal + self.col_numeric
def dump_pkl(self, path, model):
"""
Serialize model to pickle file.
Args:
path: Output file path
model: Model to serialize
"""
with open(path, "wb") as f:
pickle.dump(model, f)
def load_pkl(self, path):
"""
Load model from pickle file.
Args:
path: Input pickle file path
Returns:
Deserialized model
"""
with open(path, "rb") as f:
= pickle.load(f)
model return model
def make_dtc(self):
"""
Create Decision Tree Classifier pipeline with preprocessing and grid search.
Returns:
GridSearchCV object configured for DTC
"""
# Initialize encoders and scaler
= OneHotEncoder(drop='if_binary', sparse_output=False)
ohe = OrdinalEncoder().set_output(transform='pandas')
oe = StandardScaler()
ss
# Create balanced Decision Tree
= DecisionTreeClassifier(class_weight="balanced", random_state=config.random_state)
dtc
# Set up preprocessing pipeline
= ColumnTransformer(
preprocessor =[
transformers"binary_encoder", ohe, config.col_binary),
("ordinal_encoder", oe, config.col_ordinal),
("standard_scaler", ss, config.col_numeric)
(
],='passthrough'
remainder
)
# Create full pipeline
= Pipeline(
pipe_dtc =[
steps"preprocessor", preprocessor),
("dtc", dtc)
(
]
)
# Define hyperparameter search space
= {
param_grid_dtc "dtc__criterion": ["gini", "entropy", "log_loss"],
"dtc__splitter": ["best", "random"],
"dtc__max_depth": [1, 2, 5, 10, 15]
}
# Set up multiple scoring metrics
= {
scoring "accuracy": "accuracy",
"f1": "f1",
"roc_auc": "roc_auc"
}
# Configure grid search
= GridSearchCV(
grid_search_dtc =pipe_dtc,
estimator=param_grid_dtc,
param_grid=7,
cv=scoring,
scoring="f1",
refit=-1
n_jobs
)
return grid_search_dtc
def make_rfc(self):
"""
Create Random Forest Classifier pipeline with preprocessing and grid search.
Returns:
GridSearchCV object configured for RFC
"""
# Initialize preprocessing components
= OneHotEncoder(drop='if_binary', sparse_output=False)
ohe = OrdinalEncoder().set_output(transform='pandas')
oe = StandardScaler()
ss
# Create balanced Random Forest
= RandomForestClassifier(class_weight="balanced", random_state=config.random_state)
rfc
# Set up preprocessing pipeline
= ColumnTransformer(
preprocessor =[
transformers"binary_encoder", ohe, config.col_binary),
("ordinal_encoder", oe, config.col_ordinal),
("standard_scaler", ss, config.col_numeric)
(
],='passthrough'
remainder
)
# Create full pipeline
= Pipeline(
pipe_rfc =[
steps"preprocessor", preprocessor),
("rfc", rfc)
(
]
)
# Define parameter search space
= {
param_dist_rfc 'rfc__n_estimators': [50, 75, 100],
'rfc__max_depth': [5, 10, 15],
'rfc__min_samples_split': [5, 10]
}
# Set up scoring metrics
= {
scoring "accuracy": "accuracy",
"f1": "f1",
"roc_auc": "roc_auc"
}
# Configure grid search
= GridSearchCV(
grid_search_rfc =pipe_rfc,
estimator=param_dist_rfc,
param_grid=7,
cv=scoring,
scoring="f1",
refit=-1
n_jobs
)
return grid_search_rfc
def make_xgb(self):
"""
Create XGBoost Classifier pipeline with preprocessing and randomized search.
Returns:
RandomizedSearchCV object configured for XGBoost
"""
# Initialize preprocessing components
= OneHotEncoder(drop='if_binary', sparse_output=False)
ohe = OrdinalEncoder().set_output(transform='pandas')
oe = StandardScaler()
ss
# Set up preprocessing pipeline
= ColumnTransformer(
preprocessor =[
transformers"binary_encoder", ohe, config.col_binary),
("ordinal_encoder", oe, config.col_ordinal),
("standard_scaler", ss, config.col_numeric)
(
],='passthrough'
remainder
)
# Calculate class weight
= len(y[y == 0]) / len(y[y == 1])
scale_pos_weight
# Create XGBoost classifier
= XGBClassifier(
xgb =-1,
n_jobs=True,
enable_categorical=scale_pos_weight,
scale_pos_weight=config.random_state
random_state
)
# Create full pipeline
= Pipeline(
pipe_xgb =[
steps"preprocessor", preprocessor),
("xgb", xgb)
(
]
)
# Define comprehensive parameter search space
= {
param_dist_xgb "xgb__n_estimators": [100, 150, 200, 300],
"xgb__max_depth": [3, 5, 7, 10],
"xgb__learning_rate": [0.1, 0.01, 0.001, 0.0001],
"xgb__subsample": [0.7, 0.8, 0.9],
"xgb__colsample_bytree": [0.7, 0.8, 0.9],
"xgb__gamma": [0, 0.1],
"xgb__alpha": [0, 0.1],
"xgb__lambda": [1, 2]
}
# Set up scoring metrics
= {
scoring "accuracy": "accuracy",
"f1": "f1",
"roc_auc": "roc_auc"
}
# Configure randomized search
= RandomizedSearchCV(
random_search_xgb =pipe_xgb,
estimator=param_dist_xgb,
param_distributions=30,
n_iter=7,
cv=scoring,
scoring="f1",
refit=-1
n_jobs
)
return random_search_xgb
# Create global config instance
= Config() config
3 Dataset
Now we can import the dataset. Having all columns UPPERCASE, I will set all of them as lowercase.
Code
df = pd.read_csv(config.file)
df.columns = df.columns.str.lower()
# df["birthday"] = df.days_birth.map(lambda x : (date.today() + relativedelta(days=x)).strftime("%Y-%m-%d"))
# df["employed"] = df["days_employed"].apply(lambda x: (date.today() + relativedelta(days=x)).strftime("%Y-%m-%d") if x < 0 else "0")
4 EDA
4.1 Target
It is always important to perform an EDA. In this case I am expecting to have less cases of high creditworthiness than low: for my personal knowledge and experience, high creditworthiness clients are not the majority. Also, considering it is a binary classification, it is important to be sure if data are balanced to make a better model selection.
The variable:
- 0 has 308,705.
- 1 has 29,722.
Code
library(reticulate)
df <- py$df
df_g <- df %>%
rename(label = target) %>%
mutate(label = as.factor(label)) %>%
summarize(
freq_abs = n(),
freq_rel = n() / nrow(df),
.by = label
)
df_g %>%
gt() %>%
fmt_auto() %>%
cols_width(
label ~ pct(20),
freq_abs ~ pct(35),
freq_rel ~ pct(35)
) %>%
cols_align(
align = "center",
columns = c(freq_abs, freq_rel)
) %>%
tab_header(
title = "Label frequency",
subtitle = "Absolute and relative frequencies"
) %>%
cols_label(
label = "Label",
freq_abs = "Absolute frequency",
freq_rel = "Relative frequency"
) %>%
tab_options(
table.width = pct(100)
)
Label frequency | ||
---|---|---|
Absolute and relative frequencies | ||
Label | Absolute frequency | Relative frequency |
0 | 308,705 | 0.912 |
1 | 29,722 | 0.088 |
The dataset is unbalanced as expected: most of the target are low creditworthiness (0 label). This is important in the training: unbalanced datasets have higher risk of overfitting. If 9 out of 10 observatinos are 0 and you predict 10 times 0, the model has an accuracy of 0.9!
Code
id code_gender flag_own_car flag_own_realty
Min. :5008804 Length:338427 Length:338427 Length:338427
1st Qu.:5439602 Class :character Class :character Class :character
Median :5878907 Mode :character Mode :character Mode :character
Mean :5821200
3rd Qu.:6140206
Max. :6841875
cnt_children amt_income_total name_income_type name_education_type
Min. : 0.0000 Min. : 26100 Length:338427 Length:338427
1st Qu.: 0.0000 1st Qu.: 121500 Class :character Class :character
Median : 0.0000 Median : 162000 Mode :character Mode :character
Mean : 0.4289 Mean : 187654
3rd Qu.: 1.0000 3rd Qu.: 225000
Max. :19.0000 Max. :6750000
name_family_status name_housing_type days_birth days_employed
Length:338427 Length:338427 Min. :-25201 Min. :-17531
Class :character Class :character 1st Qu.:-19482 1st Qu.: -3116
Mode :character Mode :character Median :-15622 Median : -1485
Mean :-15998 Mean : 60238
3rd Qu.:-12524 3rd Qu.: -380
Max. : -7489 Max. :365243
NA's :1 NA's :1
flag_mobil flag_work_phone flag_phone flag_email
Min. :1 Min. :0.0000 Min. :0.0000 Min. :0.0000
1st Qu.:1 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000
Median :1 Median :0.0000 Median :0.0000 Median :0.0000
Mean :1 Mean :0.2114 Mean :0.2933 Mean :0.1052
3rd Qu.:1 3rd Qu.:0.0000 3rd Qu.:1.0000 3rd Qu.:0.0000
Max. :1 Max. :1.0000 Max. :1.0000 Max. :1.0000
NA's :1 NA's :1 NA's :1 NA's :1
occupation_type cnt_fam_members target
Length:338427 Min. : 1.000 Min. :0.00000
Class :character 1st Qu.: 2.000 1st Qu.:0.00000
Mode :character Median : 2.000 Median :0.00000
Mean : 2.197 Mean :0.08782
3rd Qu.: 3.000 3rd Qu.:0.00000
Max. :20.000 Max. :1.00000
NA's :1
Code
- 0. stands for low creditworthiness.
- 1 stands for high.
- Our dataset is strongly unbalanced towards the 0 label.
4.2 Features
Lets check the NA values in each feature.
Code
df_na <- df %>%
summarize(across(everything(), ~ sum(is.na(.)))) %>%
pivot_longer(
cols = everything(),
names_to = "columns",
values_to = "value"
)
df_na %>%
gt() %>%
tab_options(
table.width = pct(100)
) %>%
cols_align(align = "center") %>%
cols_label(
columns = "Columns",
value = "NA Value"
) %>%
tab_header(
title = "NA values",
subtitle = "Number of NA value for each feature"
) %>%
fmt_number(
columns = c("value"),
decimals = 2,
drop_trailing_zeros = TRUE,
#drop_trailing_dec_mark = FALSE
)
NA values | |
---|---|
Number of NA value for each feature | |
Columns | NA Value |
id | 0 |
code_gender | 0 |
flag_own_car | 0 |
flag_own_realty | 0 |
cnt_children | 0 |
amt_income_total | 0 |
name_income_type | 0 |
name_education_type | 0 |
name_family_status | 1 |
name_housing_type | 1 |
days_birth | 1 |
days_employed | 1 |
flag_mobil | 1 |
flag_work_phone | 1 |
flag_phone | 1 |
flag_email | 1 |
occupation_type | 103,342 |
cnt_fam_members | 1 |
target | 0 |
There are 2 interesting things to notice:
The column occupation_type
has lots of NA value. How much in proportion to the total? Should we drop the NAs or keep them?
Code
df %>%
summarize(
freq_abs = sum(is.na(occupation_type)),
freq_rel = sum(is.na(occupation_type)) / nrow(df)
) %>%
gt() %>%
tab_options(
table.width = pct(100)
) %>%
cols_align(align = "center") %>%
cols_label(
freq_abs = "Absolute Frequency",
freq_rel = "Relative Frequency"
) %>%
tab_header(
title = "Absolute and Relative Frequency",
subtitle = "NA frequency for occupation_type"
) %>%
fmt_number(
columns = c("freq_abs", "freq_rel"),
decimals = 2,
drop_trailing_zeros = TRUE
)
Absolute and Relative Frequency | |
---|---|
NA frequency for occupation_type | |
Absolute Frequency | Relative Frequency |
103,342 | 0.31 |
30.54% of occupation_type are NA. This could be a big concern: how to handle it? Drop this much could mean losing lots of useful information, while keeping it could lead to impute wrong data, which could be worse.
Lets check the occupation_type labels frequency.
Code
df %>%
summarise(
n = n(),
.by = occupation_type
) %>%
arrange(desc(n)) %>%
gt() %>%
tab_options(
table.width = pct(100)
) %>%
cols_align(align = "center") %>%
cols_label(
occupation_type = "Occupation Type",
n = "Frequency"
) %>%
tab_header(
title = "Occupation Type Label Frequency"
) %>%
fmt_number(
columns = c("n"),
decimals = 2,
drop_trailing_zeros = TRUE
)
Occupation Type Label Frequency | |
---|---|
Occupation Type | Frequency |
NA | 103,342 |
Laborers | 60,146 |
Core staff | 33,527 |
Sales staff | 31,652 |
Managers | 27,384 |
Drivers | 20,020 |
High skill tech staff | 13,399 |
Accountants | 12,281 |
Medicine staff | 10,438 |
Cooking staff | 6,248 |
Security staff | 6,218 |
Cleaning staff | 4,594 |
Private service staff | 2,787 |
Low-skill Laborers | 1,714 |
Secretaries | 1,577 |
Waiters/barmen staff | 1,245 |
Realty agents | 852 |
HR staff | 567 |
IT staff | 436 |
So here is the situation:
- we could impute the NA respecting the proportion of the other labels in the dataset or
- we can drop the NA values.
I want to drop the NA. Even if we are losing a third of the data, we are not making assumption based on the mode. Occupation_type could be a very important feature and doing inference on it is worst then lose some data. Also, we still have 235085, which are plenty enough.
Code
df_occ_type <- df %>%
filter(!is.na(occupation_type)) %>%
summarise(
n = n(),
.by = occupation_type
) %>%
arrange(desc(n))
# to extend the palette to the number of labels
extended_palette <- colorRampPalette(RColorBrewer::brewer.pal(9, "Purples"))(df_occ_type %>% nrow())
g <- ggplot(df_occ_type) +
geom_col(aes(x = n, y = reorder(occupation_type, n), fill = reorder(occupation_type, n))) +
theme_minimal() +
labs(
x = "Count",
y = "Labels",
title = "Occupation Type Label Frequency"
) +
scale_fill_manual(values = extended_palette)
ggplotly(g)
There is a NA value in at least all the columns: is it possible it is the same observation?
First of all, we can create a list of the columns name which have only 1 NA, using the df_na
tibble.
[1] "name_family_status" "name_housing_type" "days_birth"
[4] "days_employed" "flag_mobil" "flag_work_phone"
[7] "flag_phone" "flag_email" "cnt_fam_members"
Then, to check multiple OR condition, we can use if_any()
in R. In this case, we are checking if multiple columns are NA, with the previous filter on only one NA in each column.
Code
id | code_gender | flag_own_car | flag_own_realty | cnt_children | amt_income_total | name_income_type | name_education_type | name_family_status | name_housing_type | days_birth | days_employed | flag_mobil | flag_work_phone | flag_phone | flag_email | occupation_type | cnt_fam_members | target |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
6392180 | F | N | N | 0 | 67500 | Working | Secondary / se | NA | NA | NaN | NaN | NaN | NaN | NaN | NaN | NA | NaN | 0 |
Warning: Using an external vector in selections was deprecated in tidyselect 1.1.0.
ℹ Please use `all_of()` or `any_of()` instead.
# Was:
data %>% select(one_cols)
# Now:
data %>% select(all_of(one_cols))
See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
Interesting features
Lets also look at some interesting variables: click on the different tabs to see the frequency table for each of them.
Future upgrade will convert this tabset to a Shiny app. I am encountering some difficulties with shinylive, meaning it is not showing the plot.
Code
[1] "code_gender" "flag_own_car" "flag_own_realty"
[4] "name_income_type" "name_education_type" "name_family_status"
[7] "name_housing_type" "occupation_type" "cnt_children"
[10] "amt_income_total" "cnt_fam_members"
Code
count_results$cnt_children %>%
arrange(desc(n)) %>%
gt() %>%
tab_options(
table.width = pct(100)
) %>%
cols_align(align = "center") %>%
cols_label(
cnt_children = "# Children",
n = "Count"
) %>%
tab_header(
title = "Count of Number of Children",
) %>%
fmt_number(
columns = c("n"),
decimals = 2,
drop_trailing_zeros = TRUE,
#drop_trailing_dec_mark = FALSE
)
Count of Number of Children | |
---|---|
# Children | Count |
0 | 148,909 |
1 | 56,238 |
2 | 26,196 |
3 | 3,319 |
4 | 300 |
5 | 109 |
9 | 4 |
12 | 4 |
14 | 3 |
7 | 2 |
19 | 1 |
Code
count_results$name_education_type %>%
arrange(desc(n)) %>%
gt() %>%
tab_options(
table.width = pct(100)
) %>%
cols_align(align = "center") %>%
cols_label(
name_education_type = "Education Type",
n = "Count"
) %>%
tab_header(
title = "Count by Education Type",
) %>%
fmt_number(
columns = c("n"),
decimals = 2,
drop_trailing_zeros = TRUE,
#drop_trailing_dec_mark = FALSE
)
Count by Education Type | |
---|---|
Education Type | Count |
Secondary / secondary special | 157,822 |
Higher education | 66,693 |
Incomplete higher | 8,799 |
Lower secondary | 1,609 |
Academic degree | 162 |
Code
count_results$name_family_status %>%
arrange(desc(n)) %>%
gt() %>%
tab_options(
table.width = pct(100)
) %>%
cols_align(align = "center") %>%
cols_label(
name_family_status = "Family Status",
n = "Count"
) %>%
tab_header(
title = "Count by Family Status",
) %>%
fmt_number(
columns = c("n"),
decimals = 2,
drop_trailing_zeros = TRUE,
#drop_trailing_dec_mark = FALSE
)
Count by Family Status | |
---|---|
Family Status | Count |
Married | 164,310 |
Single / not married | 30,551 |
Civil marriage | 20,968 |
Separated | 14,108 |
Widow | 5,148 |
Code
count_results$name_housing_type %>%
arrange(desc(n)) %>%
gt() %>%
tab_options(
table.width = pct(100)
) %>%
cols_align(align = "center") %>%
cols_label(
name_housing_type = "House Type",
n = "Count"
) %>%
tab_header(
title = "Count by House Type",
) %>%
fmt_number(
columns = c("n"),
decimals = 2,
drop_trailing_zeros = TRUE,
#drop_trailing_dec_mark = FALSE
)
Count by House Type | |
---|---|
House Type | Count |
House / apartment | 208,980 |
With parents | 11,938 |
Municipal apartment | 7,411 |
Rented apartment | 3,549 |
Office apartment | 2,270 |
Co-op apartment | 937 |
Code
count_results$cnt_fam_members %>%
arrange(desc(n)) %>%
gt() %>%
tab_options(
table.width = pct(100)
) %>%
cols_align(align = "center") %>%
cols_label(
cnt_fam_members = "Number of Family Members",
n = "Count"
) %>%
tab_header(
title = "Count by Number of Family Members",
) %>%
fmt_number(
columns = c("n"),
decimals = 2,
drop_trailing_zeros = TRUE,
#drop_trailing_dec_mark = FALSE
)
Count by Number of Family Members | |
---|---|
Number of Family Members | Count |
2 | 118,852 |
3 | 49,232 |
1 | 38,902 |
4 | 24,556 |
5 | 3,138 |
6 | 282 |
7 | 109 |
11 | 4 |
14 | 4 |
15 | 3 |
9 | 2 |
20 | 1 |
Code
count_results$code_gender %>%
arrange(desc(n)) %>%
gt() %>%
tab_options(
table.width = pct(100)
) %>%
cols_align(align = "center") %>%
cols_label(
code_gender = "Gender",
n = "Count"
) %>%
tab_header(
title = "Count by Gender",
) %>%
fmt_number(
columns = c("n"),
decimals = 2,
drop_trailing_zeros = TRUE,
#drop_trailing_dec_mark = FALSE
)
Count by Gender | |
---|---|
Gender | Count |
F | 147,505 |
M | 87,580 |
5 Data preparation
From the EDA, we know we can drop
- NA occupation_type.
- the id 6.39218^{6}.
6 Split
To take into consideration the unbalance and have it also in our split, we can use the stratify
method.
7 Models
To address the model explainability, we can choose a CART model. They have 2 interesting point for our task:
- Explainability, as already mentioned. This will help us explain the reason of rejection.
- Feature selection. At this link you can find a paper which shows how embedded method as the tree models can have the same performance of traditional feature selection method.
It could be also interesting to do a Recursive Feature Elimination. Right now the feature selection is made with personal knowledge to remove some columns before training. The CART model will perform the feature selection, as the paper above explain.
We are going to implement 3 models with an increasingly level of complexity:
- DecisionTreeClassifier.
- RandomForestClassifier.
- XGBoostClassifier.
7.1 DecisionTreeClassifier
7.1.1 Pipeline
Even if the result of the model are relatively easy to understand, the hyperparameter tuning has far too many possibility. This is where GridSearchCV()
comes handy. Check Listing 1 for more details.
One of the many nice sklearn
features is the Pipeline()
. It has lots of important benefits:
- Code redundancy and reproducibility : it is like a recipe you can tune and run without worrying about to put togheter each step in the correct order.
- Avoid data leakege: the preprocessing steps applies only to the training dataset.
-
Hyperparameter tuning: the
Pipeline()
integrates withGridSearchCV()
andRandomizedSearchCV()
. - Modularity: it enables tuning the various steps, removing or adding new one.
First, initialize the various preprocessor and the classification model.
Using the ColumnTransformer()
, we can apply the different preprocessing to the specific columns. The preprocessing is divided between:
- Encoding the binary labels to 0-1 using
OneHotEncoding(drop="if_binary)"
. - Encoding the remaining labels using
OneHotEncoding()
. I choose this over other categorical variables encoding because it avoid imputing a hierarchy. -
Standardize the numerical variables using
StandardScaler()
. I also have instantiated theMinMaxScaler()
, but I am not using it.
The classifier is a DecisionTreeClassifier()
, which the GridSearchCV()
will tune.
All of these steps are putted togheter with the Pipeline()
.
For the binary labels I had to use the OneHotEncoding(drop="if_binary")
, otherwise other preprocessor would not work with the ColumnTransformer()
.
7.1.2 Fit
Check Listing 1 for more details.
7.1.3 Pickle
7.1.4 Classification Report
Code
library(reticulate)
df_cr <- py$df_cr_dtc %>% dplyr::rename(names = index)
cols <- df_cr %>% colnames()
df_cr %>%
pivot_longer(
cols = -names,
names_to = "metrics",
values_to = "values"
) %>%
pivot_wider(
names_from = names,
values_from = values
) %>%
gt() %>%
tab_header(
title = "Confusion Matrix",
subtitle = "Threshold optimization favoring recall"
) %>%
fmt_number(
columns = c("precision", "recall", "f1-score", "support"),
decimals = 2,
drop_trailing_zeros = TRUE,
drop_trailing_dec_mark = FALSE
) %>%
cols_align(
align = "center",
columns = c("precision", "recall", "f1-score", "support")
) %>%
cols_align(
align = "left",
columns = metrics
) %>%
cols_label(
metrics = "Metrics",
precision = "Precision",
recall = "Recall",
`f1-score` = "F1-Score",
support = "Support"
) %>%
tab_options(
table.width = pct(100)
)
Confusion Matrix | ||||
---|---|---|---|---|
Threshold optimization favoring recall | ||||
Metrics | Precision | Recall | F1-Score | Support |
0 | 0.99 | 0.74 | 0.85 | 41,986. |
1 | 0.3 | 0.94 | 0.46 | 5,031. |
accuracy | 0.76 | 0.76 | 0.76 | 0.76 |
macro avg | 0.65 | 0.84 | 0.65 | 47,017. |
weighted avg | 0.92 | 0.76 | 0.81 | 47,017. |
7.2 RandomForestClassifier
7.2.1 Pipeline
Check Listing 1 for more details.
7.2.2 Fit
7.2.3 Pickle
7.2.4 Classification
Code
library(reticulate)
df_cr <- py$df_cr_rfc %>% dplyr::rename(names = index)
cols <- df_cr %>% colnames()
df_cr %>%
pivot_longer(
cols = -names,
names_to = "metrics",
values_to = "values"
) %>%
pivot_wider(
names_from = names,
values_from = values
) %>%
gt() %>%
tab_header(
title = "Confusion Matrix",
subtitle = "Threshold optimization favoring recall"
) %>%
fmt_number(
columns = c("precision", "recall", "f1-score", "support"),
decimals = 2,
drop_trailing_zeros = TRUE,
drop_trailing_dec_mark = FALSE
) %>%
cols_align(
align = "center",
columns = c("precision", "recall", "f1-score", "support")
) %>%
cols_align(
align = "left",
columns = metrics
) %>%
cols_label(
metrics = "Metrics",
precision = "Precision",
recall = "Recall",
`f1-score` = "F1-Score",
support = "Support"
) %>%
tab_options(
table.width = pct(100)
)
Confusion Matrix | ||||
---|---|---|---|---|
Threshold optimization favoring recall | ||||
Metrics | Precision | Recall | F1-Score | Support |
0 | 0.99 | 0.77 | 0.86 | 41,986. |
1 | 0.33 | 0.95 | 0.49 | 5,031. |
accuracy | 0.79 | 0.79 | 0.79 | 0.79 |
macro avg | 0.66 | 0.86 | 0.68 | 47,017. |
weighted avg | 0.92 | 0.79 | 0.82 | 47,017. |
7.3 XGBoostClassifier
7.3.1 Pipeline
Check Listing 1 for more details.
7.3.2 Fit
7.3.3 Pickle
7.3.4 Classification
Code
library(reticulate)
df_cr <- py$df_cr_xgb %>% dplyr::rename(names = index)
cols <- df_cr %>% colnames()
df_cr %>%
pivot_longer(
cols = -names,
names_to = "metrics",
values_to = "values"
) %>%
pivot_wider(
names_from = names,
values_from = values
) %>%
gt() %>%
tab_header(
title = "Confusion Matrix",
subtitle = "Threshold optimization favoring recall"
) %>%
fmt_number(
columns = c("precision", "recall", "f1-score", "support"),
decimals = 2,
drop_trailing_zeros = TRUE,
drop_trailing_dec_mark = FALSE
) %>%
cols_align(
align = "center",
columns = c("precision", "recall", "f1-score", "support")
) %>%
cols_align(
align = "left",
columns = metrics
) %>%
cols_label(
metrics = "Metrics",
precision = "Precision",
recall = "Recall",
`f1-score` = "F1-Score",
support = "Support"
) %>%
tab_options(
table.width = pct(100)
)
Confusion Matrix | ||||
---|---|---|---|---|
Threshold optimization favoring recall | ||||
Metrics | Precision | Recall | F1-Score | Support |
0 | 0.99 | 0.82 | 0.9 | 41,986. |
1 | 0.39 | 0.95 | 0.55 | 5,031. |
accuracy | 0.83 | 0.83 | 0.83 | 0.83 |
macro avg | 0.69 | 0.89 | 0.72 | 47,017. |
weighted avg | 0.93 | 0.83 | 0.86 | 47,017. |
8 Models comparison
Lets check with a plot the performance of all the models. The AUC ROC makes it possible. It is a plot useful for classification tasks. It plots False vs True Positive Rate.
Code
from sklearn.metrics import roc_curve, roc_auc_score
import matplotlib.pyplot as plt
plt.figure(figsize=(20, 15))
model_names = ["dtc", "rfc", "xgb"]
models_ = [dtc_model, rfc_model, xgb_model]
models = zip(model_names, models_)
for name, model in models:
fpr, tpr, _ = roc_curve(ytest, model.predict_proba(xtest)[:, 1])
auc_score = roc_auc_score(ytest, model.predict_proba(xtest)[:, 1])
plt.plot(fpr, tpr, label=f'{name} (AUC: {auc_score:.2f})')
[<matplotlib.lines.Line2D object at 0x159a0f850>]
[<matplotlib.lines.Line2D object at 0x161d3f090>]
[<matplotlib.lines.Line2D object at 0x161d3f610>]
Text(0.5, 0, 'False Positive Rate')
Text(0, 0.5, 'True Positive Rate')
Text(0.5, 1.0, 'ROC Curves Comparison')
<matplotlib.legend.Legend object at 0x161d1d950>
From the ROC AUC, the best model is the xgboost, because the Area Under the Curve is higher.
9 Explainability
What if the model predict a low creditworthiness and the business wants to know why? Shapley values make this possible. It comes from cooperative game theory: for more details, check this link and this link.
For our purpose, I have found 3 observations we can use. It is also useful to separate the preprocessor and classificator part of the pipeline.
Code
# retain features name
xtrain_processed = preprocessor.transform(xtrain)
xtrain_processed = pd.DataFrame(xtrain_processed, columns=config.features)
# retain features name
xtest_processed = preprocessor.transform(xtest)
xtest_processed = pd.DataFrame(xtest_processed, columns=config.features)
# retain features name
clf.get_booster().feature_names = config.features
# convert to DMatrix
dtrain = DMatrix(xtrain_processed, label=ytrain)
dtest = DMatrix(xtest_processed, label=ytest)
# usefol idxs
idx = [15454, 1284, 30305]
x_processed = preprocessor.transform(x)
Code
<Axes: title={'center': 'Feature Importance by Gain'}, xlabel='F score', ylabel='Features'>
The importance plot tells us the information gain each feature gives to the model prediction. The 3 most important features are also the most logical one:
- Total income amount.
- Number of childrens.
- Type of housing.
Importance provides a score that indicates how useful useful each feature is in the construcition of the boosted decision trees. The more an attribute is used to make key decisions with decision trees, the higher its relative importance.
Code
# force plot
plt.figure(figsize=(10, 8))
plot_force = shap.plots.force(
base_value=explainer.expected_value,
shap_values=shap_values[idx[0], :],
# features=None,
feature_names=config.features,
# out_names=None,
# link='identity', # "logit"
plot_cmap='RdBu',
matplotlib=True,
show=True
)
plt.savefig("plot_force.jpeg", dpi=200)
# calculate specific shap explainer for waterfall plot
explainer_ = shap.Explainer(model=xgb_clf)
shap_values_ = explainer(x_processed)
shap_values_.feature_names = config.features
# waterfall plot
shap.plots.waterfall(
shap_values_[0],
max_display=len(config.features)
)
plt.savefig("plot_waterfall.jpeg", dpi=200)
# bar plot
shap.plots.bar(
shap_values_[0],
max_display=len(config.features)
)
plt.savefig("plot_bar.jpeg", dpi=200)
The force plot explains the contribution of features to a specific model prediction.
Base value is the model expected output or mean prediction over the dataset. It is the prediction the model would make without any specific input features. It is the starting point of the plot.
Feature Contributions. The SHAP values represent how much each feature contributes to the difference between the vase value and the final prediction for a specific observation.
Positive contributions. Features pushing the prediction higher. Shown in red.
Negative contributions. Features pulling the prediction lower. Shown in blue. The magnitude of each bar indicates the size of the feature’s contribution.
Final Prediction. Endpoint of the plot. It is the sum of the base value and all the SHAP values for that observation.
Waterfall plots displays explanations for individual predictions. The bottom of a waterfall plot starts as the expected value of the model output, and then each row shows how the positive (red) or negative (blue) contribution of each feature moves the value from the expected model output over the background dataset to the model output for this prediction.
Due to venv incompatibilities between shap and other packages, the shap evaluation has been made on Kaggle. The plots are the results of the notebook.