Credit Score

ML explainability classification
code
ML
Python, R
Author

Simne Brazzi

Published

October 26, 2024

1 Introduction

Create a model capable of estimating the creditworthiness of customers, in order to help the dedicated team understand whether or not to accept a credit card application.

1.1 Features

  • ID: customer identification number
  • CODE_GENDER: gender of the customer
  • FLAGOWNCAR: indicator of car ownership
  • FLAGOWNREALTY: indicator of house ownership
  • CNT_CHILDREN: number of children
  • AMTINCOMETOTAL: annual income
  • NAMEINCOMETYPE: type of income
  • NAMEEDUCATIONTYPE: level of education
  • NAMEFAMILYSTATUS: marital status
  • NAMEHOUSINGTYPE: type of dwelling
  • DAYS_BIRTH: number of days since birth
  • DAYS_EMPLOYED: number of days since employment (if positive, indicates the number of days since being unemployed)
  • FLAG_MOBIL: indicator for the presence of a mobile phone number
  • FLAGWORKPHONE: indicator of the presence of a work phone number
  • FLAG_PHONE: indicator of the presence of a telephone number
  • FLAG_EMAIL: indicator for the presence of an email address
  • OCCUPATION_TYPE: type of occupation
  • CNTFAMMEMBERS: number of family members
  • TARGET: variable which is worth 1 if the customer has a high creditworthiness (constant payment of installments), 0 otherwise.

If a customer is denied a credit card, the team must be able to give a reason. This means that your model must provide indications that are easy to interpret.

It is a binary classification which needs a good explainability.

2 Import

2.1 R

Code
library(tidyverse, verbose = FALSE)
library(ggplot2)
library(plotly)
library(reticulate)
library(gt)
library(scales)
library(shapr)

2.2 Python

Code
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import pickle
import time
import os

from datetime import date
from dateutil.relativedelta import relativedelta
from sklearn.model_selection import train_test_split, GridSearchCV, RandomizedSearchCV
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder, StandardScaler, MinMaxScaler, LabelBinarizer
from sklearn.pipeline import Pipeline
from sklearn.tree import DecisionTreeClassifier
from xgboost import XGBClassifier
from xgboost import DMatrix
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report

2.3 Config class

Listing 1: Config class

class Config():
    def __init__(self):
        """
        Initialize configuration settings for credit scoring model pipeline.
        Sets up file paths, random state, and feature column classifications.
        """
        # Base directory and file paths
        self.path="/Users/simonebrazzi/R/blog/posts/credit_score/"
        self.file="/Users/simonebrazzi/R/blog/posts/credit_score/credit_scoring.csv"
        self.pkl_path="~/R/blog/posts/credit_score/pkl/"
        
        # Set random state for reproducibility
        self.random_state=42
        
        # Define feature columns by type
        self.col_binary = ['code_gender', 'flag_own_car', 'flag_own_realty']
        self.col_ordinal = ["name_income_type", "name_education_type", "name_family_status", "name_housing_type", "occupation_type"]
        self.col_numeric = ['cnt_children', 'amt_income_total', 'cnt_fam_members']
        self.features = self.col_binary + self.col_ordinal + self.col_numeric
    
    def dump_pkl(self, path, model):
        """
        Serialize model to pickle file.
        Args:
            path: Output file path
            model: Model to serialize
        """
        with open(path, "wb") as f:
            pickle.dump(model, f)
    
    def load_pkl(self, path):
        """
        Load model from pickle file.
        Args:
            path: Input pickle file path
        Returns:
            Deserialized model
        """
        with open(path, "rb") as f:
            model = pickle.load(f)
        return model
    
    def make_dtc(self):
        """
        Create Decision Tree Classifier pipeline with preprocessing and grid search.
        Returns:
            GridSearchCV object configured for DTC
        """
        # Initialize encoders and scaler
        ohe = OneHotEncoder(drop='if_binary', sparse_output=False)
        oe = OrdinalEncoder().set_output(transform='pandas')
        ss = StandardScaler()
        
        # Create balanced Decision Tree
        dtc = DecisionTreeClassifier(class_weight="balanced", random_state=config.random_state)
        
        # Set up preprocessing pipeline
        preprocessor = ColumnTransformer(
            transformers=[
                ("binary_encoder", ohe, config.col_binary),
                ("ordinal_encoder", oe, config.col_ordinal),
                ("standard_scaler", ss, config.col_numeric)
            ],
            remainder='passthrough'
        )
        
        # Create full pipeline
        pipe_dtc = Pipeline(
            steps=[
                ("preprocessor", preprocessor),
                ("dtc", dtc)
            ]
        )
        
        # Define hyperparameter search space
        param_grid_dtc = {
            "dtc__criterion": ["gini", "entropy", "log_loss"],
            "dtc__splitter": ["best", "random"],
            "dtc__max_depth": [1, 2, 5, 10, 15]
        }
        
        # Set up multiple scoring metrics
        scoring = {
            "accuracy": "accuracy",
            "f1": "f1",
            "roc_auc": "roc_auc"
        }
        
        # Configure grid search
        grid_search_dtc = GridSearchCV(
            estimator=pipe_dtc,
            param_grid=param_grid_dtc,
            cv=7,
            scoring=scoring,
            refit="f1",
            n_jobs=-1
        )
        
        return grid_search_dtc
    
    def make_rfc(self):
        """
        Create Random Forest Classifier pipeline with preprocessing and grid search.
        Returns:
            GridSearchCV object configured for RFC
        """
        # Initialize preprocessing components
        ohe = OneHotEncoder(drop='if_binary', sparse_output=False)
        oe = OrdinalEncoder().set_output(transform='pandas')
        ss = StandardScaler()
        
        # Create balanced Random Forest
        rfc = RandomForestClassifier(class_weight="balanced", random_state=config.random_state)
        
        # Set up preprocessing pipeline
        preprocessor = ColumnTransformer(
            transformers=[
                ("binary_encoder", ohe, config.col_binary),
                ("ordinal_encoder", oe, config.col_ordinal),
                ("standard_scaler", ss, config.col_numeric)
            ],
            remainder='passthrough'
        )
        
        # Create full pipeline
        pipe_rfc = Pipeline(
            steps=[
                ("preprocessor", preprocessor),
                ("rfc", rfc)
            ]
        )
        
        # Define parameter search space
        param_dist_rfc = {
            'rfc__n_estimators': [50, 75, 100],
            'rfc__max_depth': [5, 10, 15],
            'rfc__min_samples_split': [5, 10]
        }
        
        # Set up scoring metrics
        scoring = {
            "accuracy": "accuracy",
            "f1": "f1",
            "roc_auc": "roc_auc"
        }
        
        # Configure grid search
        grid_search_rfc = GridSearchCV(
            estimator=pipe_rfc,
            param_grid=param_dist_rfc,
            cv=7,
            scoring=scoring,
            refit="f1",
            n_jobs=-1
        )
        
        return grid_search_rfc
    
    def make_xgb(self):
        """
        Create XGBoost Classifier pipeline with preprocessing and randomized search.
        Returns:
            RandomizedSearchCV object configured for XGBoost
        """
        # Initialize preprocessing components
        ohe = OneHotEncoder(drop='if_binary', sparse_output=False)
        oe = OrdinalEncoder().set_output(transform='pandas')
        ss = StandardScaler()
        
        # Set up preprocessing pipeline
        preprocessor = ColumnTransformer(
            transformers=[
                ("binary_encoder", ohe, config.col_binary),
                ("ordinal_encoder", oe, config.col_ordinal),
                ("standard_scaler", ss, config.col_numeric)
            ],
            remainder='passthrough'
        )
        
        # Calculate class weight
        scale_pos_weight = len(y[y == 0]) / len(y[y == 1])
        
        # Create XGBoost classifier
        xgb = XGBClassifier(
            n_jobs=-1,
            enable_categorical=True,
            scale_pos_weight=scale_pos_weight,
            random_state=config.random_state
        )
        
        # Create full pipeline
        pipe_xgb = Pipeline(
            steps=[
                ("preprocessor", preprocessor),
                ("xgb", xgb)
            ]
        )
        
        # Define comprehensive parameter search space
        param_dist_xgb = {
            "xgb__n_estimators": [100, 150, 200, 300],
            "xgb__max_depth": [3, 5, 7, 10],
            "xgb__learning_rate": [0.1, 0.01, 0.001, 0.0001],
            "xgb__subsample": [0.7, 0.8, 0.9],
            "xgb__colsample_bytree": [0.7, 0.8, 0.9],
            "xgb__gamma": [0, 0.1],
            "xgb__alpha": [0, 0.1],
            "xgb__lambda": [1, 2]
        }
        
        # Set up scoring metrics
        scoring = {
            "accuracy": "accuracy",
            "f1": "f1",
            "roc_auc": "roc_auc"
        }
        
        # Configure randomized search
        random_search_xgb = RandomizedSearchCV(
            estimator=pipe_xgb,
            param_distributions=param_dist_xgb,
            n_iter=30,
            cv=7,
            scoring=scoring,
            refit="f1",
            n_jobs=-1
        )
        
        return random_search_xgb

# Create global config instance
config = Config()

3 Dataset

Now we can import the dataset. Having all columns UPPERCASE, I will set all of them as lowercase.

Code
df = pd.read_csv(config.file)
df.columns = df.columns.str.lower()

# df["birthday"] = df.days_birth.map(lambda x : (date.today() + relativedelta(days=x)).strftime("%Y-%m-%d"))
# df["employed"] = df["days_employed"].apply(lambda x: (date.today() + relativedelta(days=x)).strftime("%Y-%m-%d") if x < 0 else "0")

4 EDA

4.1 Target

It is always important to perform an EDA. In this case I am expecting to have less cases of high creditworthiness than low: for my personal knowledge and experience, high creditworthiness clients are not the majority. Also, considering it is a binary classification, it is important to be sure if data are balanced to make a better model selection.

Code
value_count = np.unique(df.target, return_counts=True)

var0 = value_count[0][0]
var0_count = value_count[1][0]

var1 = value_count[0][1]
var1_count = value_count[1][1]

The variable:

  • 0 has 308,705.
  • 1 has 29,722.
Code
library(reticulate)

df <- py$df

df_g <- df %>% 
  rename(label = target) %>% 
  mutate(label = as.factor(label)) %>% 
  summarize(
    freq_abs = n(),
    freq_rel = n() / nrow(df),
    .by = label
  )
df_g %>% 
  gt() %>% 
  fmt_auto() %>% 
  cols_width(
    label ~ pct(20),
    freq_abs ~ pct(35),
    freq_rel ~ pct(35)
    ) %>% 
  cols_align(
    align = "center",
    columns = c(freq_abs, freq_rel)
  ) %>% 
  tab_header(
    title = "Label frequency",
    subtitle = "Absolute and relative frequencies"
  ) %>% 
  cols_label(
    label = "Label",
    freq_abs = "Absolute frequency",
    freq_rel = "Relative frequency"
  ) %>% 
  tab_options(
    table.width = pct(100)
    )
Table 1: Absolute and relative frequencies
Label frequency
Absolute and relative frequencies
Label Absolute frequency Relative frequency
0 308,705  0.912
1  29,722  0.088

The dataset is unbalanced as expected: most of the target are low creditworthiness (0 label). This is important in the training: unbalanced datasets have higher risk of overfitting. If 9 out of 10 observatinos are 0 and you predict 10 times 0, the model has an accuracy of 0.9!

Code
# check for unique values
check_unique_values <- tibble(
  names = names(df),
  values = map(df, ~ .x %>% str_unique() %>% length())
) %>% 
  unnest(values)

df %>% 
  summary()
       id          code_gender        flag_own_car       flag_own_realty   
 Min.   :5008804   Length:338427      Length:338427      Length:338427     
 1st Qu.:5439602   Class :character   Class :character   Class :character  
 Median :5878907   Mode  :character   Mode  :character   Mode  :character  
 Mean   :5821200                                                           
 3rd Qu.:6140206                                                           
 Max.   :6841875                                                           
                                                                           
  cnt_children     amt_income_total  name_income_type   name_education_type
 Min.   : 0.0000   Min.   :  26100   Length:338427      Length:338427      
 1st Qu.: 0.0000   1st Qu.: 121500   Class :character   Class :character   
 Median : 0.0000   Median : 162000   Mode  :character   Mode  :character   
 Mean   : 0.4289   Mean   : 187654                                         
 3rd Qu.: 1.0000   3rd Qu.: 225000                                         
 Max.   :19.0000   Max.   :6750000                                         
                                                                           
 name_family_status name_housing_type    days_birth     days_employed   
 Length:338427      Length:338427      Min.   :-25201   Min.   :-17531  
 Class :character   Class :character   1st Qu.:-19482   1st Qu.: -3116  
 Mode  :character   Mode  :character   Median :-15622   Median : -1485  
                                       Mean   :-15998   Mean   : 60238  
                                       3rd Qu.:-12524   3rd Qu.:  -380  
                                       Max.   : -7489   Max.   :365243  
                                       NA's   :1        NA's   :1       
   flag_mobil flag_work_phone    flag_phone       flag_email    
 Min.   :1    Min.   :0.0000   Min.   :0.0000   Min.   :0.0000  
 1st Qu.:1    1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.0000  
 Median :1    Median :0.0000   Median :0.0000   Median :0.0000  
 Mean   :1    Mean   :0.2114   Mean   :0.2933   Mean   :0.1052  
 3rd Qu.:1    3rd Qu.:0.0000   3rd Qu.:1.0000   3rd Qu.:0.0000  
 Max.   :1    Max.   :1.0000   Max.   :1.0000   Max.   :1.0000  
 NA's   :1    NA's   :1        NA's   :1        NA's   :1       
 occupation_type    cnt_fam_members      target       
 Length:338427      Min.   : 1.000   Min.   :0.00000  
 Class :character   1st Qu.: 2.000   1st Qu.:0.00000  
 Mode  :character   Median : 2.000   Median :0.00000  
                    Mean   : 2.197   Mean   :0.08782  
                    3rd Qu.: 3.000   3rd Qu.:0.00000  
                    Max.   :20.000   Max.   :1.00000  
                    NA's   :1                         
Code
g <- ggplot(df_g) +
  geom_bar(aes(x = label, y = freq_abs, fill = label), stat = "identity") +
  xlab("Label") +
  ylab("Absolute frequency") +
  labs(fill = "Label") +
  scale_y_continuous(breaks = scales::breaks_pretty(n = 5)) +
  theme_minimal()
ggplotly(g)
Figure 1: Creditworthiness absolute frequency
  • 0. stands for low creditworthiness.
  • 1 stands for high.
  • Our dataset is strongly unbalanced towards the 0 label.

4.2 Features

Lets check the NA values in each feature.

Code
df_na <- df %>%
  summarize(across(everything(), ~ sum(is.na(.)))) %>%
  pivot_longer(
    cols = everything(),
    names_to = "columns",
    values_to = "value"
  )
df_na %>% 
  gt() %>% 
  tab_options(
    table.width = pct(100)
    ) %>% 
  cols_align(align = "center") %>% 
  cols_label(
    columns = "Columns",
    value = "NA Value"
  ) %>% 
  tab_header(
    title = "NA values",
    subtitle = "Number of NA value for each feature"
  ) %>% 
  fmt_number(
    columns = c("value"),
    decimals = 2,
    drop_trailing_zeros = TRUE,
    #drop_trailing_dec_mark = FALSE
  ) 
Table 2: Number of NA values for each feature
NA values
Number of NA value for each feature
Columns NA Value
id 0
code_gender 0
flag_own_car 0
flag_own_realty 0
cnt_children 0
amt_income_total 0
name_income_type 0
name_education_type 0
name_family_status 1
name_housing_type 1
days_birth 1
days_employed 1
flag_mobil 1
flag_work_phone 1
flag_phone 1
flag_email 1
occupation_type 103,342
cnt_fam_members 1
target 0

There are 2 interesting things to notice:

The column occupation_type has lots of NA value. How much in proportion to the total? Should we drop the NAs or keep them?

Code
df %>%
  summarize(
    freq_abs = sum(is.na(occupation_type)),
    freq_rel = sum(is.na(occupation_type)) / nrow(df)
    ) %>% 
  gt() %>% 
  tab_options(
    table.width = pct(100)
    ) %>% 
  cols_align(align = "center") %>% 
  cols_label(
    freq_abs = "Absolute Frequency",
    freq_rel = "Relative Frequency"
  ) %>% 
  tab_header(
    title = "Absolute and Relative Frequency",
    subtitle = "NA frequency for occupation_type"
  ) %>% 
  fmt_number(
    columns = c("freq_abs", "freq_rel"),
    decimals = 2,
    drop_trailing_zeros = TRUE
  ) 
Table 3: NA frequency for occupation_type
Absolute and Relative Frequency
NA frequency for occupation_type
Absolute Frequency Relative Frequency
103,342 0.31

30.54% of occupation_type are NA. This could be a big concern: how to handle it? Drop this much could mean losing lots of useful information, while keeping it could lead to impute wrong data, which could be worse.

Lets check the occupation_type labels frequency.

Code
df %>% 
  summarise(
    n = n(),
    .by = occupation_type
  ) %>% 
  arrange(desc(n)) %>% 
  gt() %>% 
  tab_options(
    table.width = pct(100)
    ) %>% 
  cols_align(align = "center") %>% 
  cols_label(
    occupation_type = "Occupation Type",
    n = "Frequency"
  ) %>% 
  tab_header(
    title = "Occupation Type Label Frequency"
  ) %>% 
  fmt_number(
    columns = c("n"),
    decimals = 2,
    drop_trailing_zeros = TRUE
  ) 
Table 4: Occupation type label frequency
Occupation Type Label Frequency
Occupation Type Frequency
NA 103,342
Laborers 60,146
Core staff 33,527
Sales staff 31,652
Managers 27,384
Drivers 20,020
High skill tech staff 13,399
Accountants 12,281
Medicine staff 10,438
Cooking staff 6,248
Security staff 6,218
Cleaning staff 4,594
Private service staff 2,787
Low-skill Laborers 1,714
Secretaries 1,577
Waiters/barmen staff 1,245
Realty agents 852
HR staff 567
IT staff 436

So here is the situation:

  1. we could impute the NA respecting the proportion of the other labels in the dataset or
  2. we can drop the NA values.

I want to drop the NA. Even if we are losing a third of the data, we are not making assumption based on the mode. Occupation_type could be a very important feature and doing inference on it is worst then lose some data. Also, we still have 235085, which are plenty enough.

Code
df_occ_type <- df %>% 
  filter(!is.na(occupation_type)) %>% 
  summarise(
    n = n(),
    .by = occupation_type
  ) %>% 
  arrange(desc(n))

# to extend the palette to the number of labels
extended_palette <- colorRampPalette(RColorBrewer::brewer.pal(9, "Purples"))(df_occ_type %>% nrow())

g <- ggplot(df_occ_type) +
  geom_col(aes(x = n, y = reorder(occupation_type, n), fill = reorder(occupation_type, n))) +
  theme_minimal() +
  labs(
    x = "Count",
    y = "Labels",
    title = "Occupation Type Label Frequency"
  ) +
  scale_fill_manual(values = extended_palette)

ggplotly(g)
Figure 2: Barplot Occupation Type Label frequency

There is a NA value in at least all the columns: is it possible it is the same observation?

First of all, we can create a list of the columns name which have only 1 NA, using the df_na tibble.

Code
one_cols <- df_na %>% 
  filter(value == 1) %>% 
  pull(columns)
one_cols
[1] "name_family_status" "name_housing_type"  "days_birth"        
[4] "days_employed"      "flag_mobil"         "flag_work_phone"   
[7] "flag_phone"         "flag_email"         "cnt_fam_members"   

Then, to check multiple OR condition, we can use if_any() in R. In this case, we are checking if multiple columns are NA, with the previous filter on only one NA in each column.

Code
library(reticulate)

df %>% 
  filter(
    if_any(all_of(one_cols), ~ . %>% is.na())
  ) %>% 
  gt()  %>% 
  tab_options(
    table.width = pct(100)
    )
id code_gender flag_own_car flag_own_realty cnt_children amt_income_total name_income_type name_education_type name_family_status name_housing_type days_birth days_employed flag_mobil flag_work_phone flag_phone flag_email occupation_type cnt_fam_members target
6392180 F N N 0 67500 Working Secondary / se NA NA NaN NaN NaN NaN NaN NaN NA NaN 0
Code
py$id_to_remove <- df %>% 
  filter(
    if_any(one_cols, ~ . %>% is.na())
  ) %>% 
  pull(id)
Warning: Using an external vector in selections was deprecated in tidyselect 1.1.0.
ℹ Please use `all_of()` or `any_of()` instead.
  # Was:
  data %>% select(one_cols)

  # Now:
  data %>% select(all_of(one_cols))

See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.

Interesting features

Lets also look at some interesting variables: click on the different tabs to see the frequency table for each of them.

Note

Future upgrade will convert this tabset to a Shiny app. I am encountering some difficulties with shinylive, meaning it is not showing the plot.

Code
# features to check
cols <- py$config$features
# 
count_results <- map(cols, ~ df %>% filter(!is.na(occupation_type)) %>% count(.data[[.x]]))
names(count_results) <- cols
cols
 [1] "code_gender"         "flag_own_car"        "flag_own_realty"    
 [4] "name_income_type"    "name_education_type" "name_family_status" 
 [7] "name_housing_type"   "occupation_type"     "cnt_children"       
[10] "amt_income_total"    "cnt_fam_members"    
Code
count_results$cnt_children %>% 
  arrange(desc(n)) %>% 
  gt() %>% 
  tab_options(
    table.width = pct(100)
    ) %>% 
  cols_align(align = "center") %>% 
  cols_label(
    cnt_children = "# Children",
    n = "Count"
  ) %>% 
  tab_header(
    title = "Count of Number of Children",
  ) %>% 
  fmt_number(
    columns = c("n"),
    decimals = 2,
    drop_trailing_zeros = TRUE,
    #drop_trailing_dec_mark = FALSE
  ) 
Table 5: Count of Number of Children
Count of Number of Children
# Children Count
0 148,909
1 56,238
2 26,196
3 3,319
4 300
5 109
9 4
12 4
14 3
7 2
19 1
Code
count_results$name_education_type %>% 
  arrange(desc(n)) %>% 
  gt()  %>% 
  tab_options(
    table.width = pct(100)
    ) %>% 
  cols_align(align = "center") %>% 
  cols_label(
    name_education_type = "Education Type",
    n = "Count"
  ) %>% 
  tab_header(
    title = "Count by Education Type",
  ) %>% 
  fmt_number(
    columns = c("n"),
    decimals = 2,
    drop_trailing_zeros = TRUE,
    #drop_trailing_dec_mark = FALSE
  ) 
Table 6: Count by Education Type
Count by Education Type
Education Type Count
Secondary / secondary special 157,822
Higher education 66,693
Incomplete higher 8,799
Lower secondary 1,609
Academic degree 162
Code
count_results$name_family_status %>% 
  arrange(desc(n)) %>% 
  gt()  %>% 
  tab_options(
    table.width = pct(100)
    ) %>% 
  cols_align(align = "center") %>% 
  cols_label(
    name_family_status = "Family Status",
    n = "Count"
  ) %>% 
  tab_header(
    title = "Count by Family Status",
  ) %>% 
  fmt_number(
    columns = c("n"),
    decimals = 2,
    drop_trailing_zeros = TRUE,
    #drop_trailing_dec_mark = FALSE
  ) 
Table 7: Count by Family Status
Count by Family Status
Family Status Count
Married 164,310
Single / not married 30,551
Civil marriage 20,968
Separated 14,108
Widow 5,148
Code
count_results$name_housing_type %>% 
  arrange(desc(n)) %>% 
  gt()  %>% 
  tab_options(
    table.width = pct(100)
    ) %>% 
  cols_align(align = "center") %>% 
  cols_label(
    name_housing_type = "House Type",
    n = "Count"
  ) %>% 
  tab_header(
    title = "Count by House Type",
  ) %>% 
  fmt_number(
    columns = c("n"),
    decimals = 2,
    drop_trailing_zeros = TRUE,
    #drop_trailing_dec_mark = FALSE
  ) 
Table 8: Count by House Type
Count by House Type
House Type Count
House / apartment 208,980
With parents 11,938
Municipal apartment 7,411
Rented apartment 3,549
Office apartment 2,270
Co-op apartment 937
Code
count_results$cnt_fam_members %>% 
  arrange(desc(n)) %>% 
  gt()  %>% 
  tab_options(
    table.width = pct(100)
    ) %>% 
  cols_align(align = "center") %>% 
  cols_label(
    cnt_fam_members = "Number of Family Members",
    n = "Count"
  ) %>% 
  tab_header(
    title = "Count by Number of Family Members",
  ) %>% 
  fmt_number(
    columns = c("n"),
    decimals = 2,
    drop_trailing_zeros = TRUE,
    #drop_trailing_dec_mark = FALSE
  ) 
Table 9: Count by Number of Family Members
Count by Number of Family Members
Number of Family Members Count
2 118,852
3 49,232
1 38,902
4 24,556
5 3,138
6 282
7 109
11 4
14 4
15 3
9 2
20 1
Code
count_results$code_gender %>% 
  arrange(desc(n)) %>% 
  gt() %>% 
  tab_options(
    table.width = pct(100)
    ) %>% 
  cols_align(align = "center") %>% 
  cols_label(
    code_gender = "Gender",
    n = "Count"
  ) %>% 
  tab_header(
    title = "Count by Gender",
  ) %>% 
  fmt_number(
    columns = c("n"),
    decimals = 2,
    drop_trailing_zeros = TRUE,
    #drop_trailing_dec_mark = FALSE
  ) 
Table 10: Count by Gender
Count by Gender
Gender Count
F 147,505
M 87,580

5 Data preparation

From the EDA, we know we can drop

  • NA occupation_type.
  • the id 6.39218^{6}.
Code
df = df[df.id != id_to_remove]
df = df[~df.occupation_type.isna()]

6 Split

To take into consideration the unbalance and have it also in our split, we can use the stratify method.

Code
x = df[config.features]
y = df.target

xtrain, xtest, ytrain, ytest = train_test_split(
    x,
    y,
    test_size = .2,
    stratify = y,
    random_state=config.random_state
)

7 Models

To address the model explainability, we can choose a CART model. They have 2 interesting point for our task:

  1. Explainability, as already mentioned. This will help us explain the reason of rejection.
  2. Feature selection. At this link you can find a paper which shows how embedded method as the tree models can have the same performance of traditional feature selection method.

It could be also interesting to do a Recursive Feature Elimination. Right now the feature selection is made with personal knowledge to remove some columns before training. The CART model will perform the feature selection, as the paper above explain.

We are going to implement 3 models with an increasingly level of complexity:

  1. DecisionTreeClassifier.
  2. RandomForestClassifier.
  3. XGBoostClassifier.

7.1 DecisionTreeClassifier

7.1.1 Pipeline

Even if the result of the model are relatively easy to understand, the hyperparameter tuning has far too many possibility. This is where GridSearchCV() comes handy. Check Listing 1 for more details.

Code
dtc = config.make_dtc()

One of the many nice sklearn features is the Pipeline(). It has lots of important benefits:

  • Code redundancy and reproducibility : it is like a recipe you can tune and run without worrying about to put togheter each step in the correct order.
  • Avoid data leakege: the preprocessing steps applies only to the training dataset.
  • Hyperparameter tuning: the Pipeline() integrates with GridSearchCV() and RandomizedSearchCV().
  • Modularity: it enables tuning the various steps, removing or adding new one.

First, initialize the various preprocessor and the classification model.

Using the ColumnTransformer(), we can apply the different preprocessing to the specific columns. The preprocessing is divided between:

  • Encoding the binary labels to 0-1 using OneHotEncoding(drop="if_binary)".
  • Encoding the remaining labels using OneHotEncoding(). I choose this over other categorical variables encoding because it avoid imputing a hierarchy.
  • Standardize the numerical variables using StandardScaler(). I also have instantiated the MinMaxScaler(), but I am not using it.

The classifier is a DecisionTreeClassifier(), which the GridSearchCV() will tune.

All of these steps are putted togheter with the Pipeline().

For the binary labels I had to use the OneHotEncoding(drop="if_binary"), otherwise other preprocessor would not work with the ColumnTransformer().

7.1.2 Fit

Code
dtc_model = dtc.fit(xtrain, ytrain)

Check Listing 1 for more details.

7.1.3 Pickle

Code
config.dump_pkl(
  path=os.path.expanduser(config.pkl_path) + "dtc_model.pkl",
  model=dtc_model
  )
Code
dtc = config.load_pkl(
  path=os.path.expanduser(config.pkl_path) + "dtc_model.pkl"
  )
dtc_model = dtc.best_estimator_

7.1.4 Classification Report

Code
ypred_dtc = dtc_model.predict(xtest)
cr = classification_report(
  ytest,
  ypred_dtc,
  target_names=['0', '1'],
  digits=4,
  output_dict=True
  )
df_cr_dtc= pd.DataFrame.from_dict(cr).reset_index()
Code
library(reticulate)
df_cr <- py$df_cr_dtc %>% dplyr::rename(names = index)
cols <- df_cr %>% colnames()
df_cr %>% 
  pivot_longer(
    cols = -names,
    names_to = "metrics",
    values_to = "values"
  ) %>% 
  pivot_wider(
    names_from = names,
    values_from = values
  ) %>% 
  gt() %>%
  tab_header(
    title = "Confusion Matrix",
    subtitle = "Threshold optimization favoring recall"
  ) %>% 
  fmt_number(
    columns = c("precision", "recall", "f1-score", "support"),
    decimals = 2,
    drop_trailing_zeros = TRUE,
    drop_trailing_dec_mark = FALSE
  ) %>% 
  cols_align(
    align = "center",
    columns = c("precision", "recall", "f1-score", "support")
  ) %>% 
  cols_align(
    align = "left",
    columns = metrics
  ) %>% 
  cols_label(
    metrics = "Metrics",
    precision = "Precision",
    recall = "Recall",
    `f1-score` = "F1-Score",
    support = "Support"
  ) %>% 
  tab_options(
    table.width = pct(100)
    )
Table 11: Classification report DTC
Confusion Matrix
Threshold optimization favoring recall
Metrics Precision Recall F1-Score Support
0 0.99 0.74 0.85 41,986.
1 0.3 0.94 0.46 5,031.
accuracy 0.76 0.76 0.76 0.76
macro avg 0.65 0.84 0.65 47,017.
weighted avg 0.92 0.76 0.81 47,017.

7.2 RandomForestClassifier

7.2.1 Pipeline

Code
rfc = config.make_rfc()

Check Listing 1 for more details.

7.2.2 Fit

Code
tic = time.time()
rfc_model = rfc.fit(xtrain, ytrain)
toc = time.time()
elapsed_time= toc - tic
print(f"Elapsed time: {elapsed_time} seconds")
print(f"Elapsed time: {elapsed_time / 60} minutes")

7.2.3 Pickle

Code
config.dump_pkl(
  path=os.path.expanduser(config.pkl_path) + "rfc_model.pkl",
  model=rfc_model
  )
Code
rfc = config.load_pkl(
  path=os.path.expanduser(config.pkl_path) + "rfc_model.pkl"
  )
rfc_model = rfc.best_estimator_

7.2.4 Classification

Code
ypred_rfc = rfc_model.predict(xtest)
cr_rfc = classification_report(
  ytest,
  ypred_rfc,
  # target_names=config.labels,
  digits=4,
  output_dict=True
  )
df_cr_rfc = pd.DataFrame.from_dict(cr_rfc).reset_index()
Code
library(reticulate)
df_cr <- py$df_cr_rfc %>% dplyr::rename(names = index)
cols <- df_cr %>% colnames()
df_cr %>% 
  pivot_longer(
    cols = -names,
    names_to = "metrics",
    values_to = "values"
  ) %>% 
  pivot_wider(
    names_from = names,
    values_from = values
  ) %>% 
  gt() %>%
  tab_header(
    title = "Confusion Matrix",
    subtitle = "Threshold optimization favoring recall"
  ) %>% 
  fmt_number(
    columns = c("precision", "recall", "f1-score", "support"),
    decimals = 2,
    drop_trailing_zeros = TRUE,
    drop_trailing_dec_mark = FALSE
  ) %>% 
  cols_align(
    align = "center",
    columns = c("precision", "recall", "f1-score", "support")
  ) %>% 
  cols_align(
    align = "left",
    columns = metrics
  ) %>% 
  cols_label(
    metrics = "Metrics",
    precision = "Precision",
    recall = "Recall",
    `f1-score` = "F1-Score",
    support = "Support"
  ) %>% 
  tab_options(
    table.width = pct(100)
    )
Table 12: Classification report RFC
Confusion Matrix
Threshold optimization favoring recall
Metrics Precision Recall F1-Score Support
0 0.99 0.77 0.86 41,986.
1 0.33 0.95 0.49 5,031.
accuracy 0.79 0.79 0.79 0.79
macro avg 0.66 0.86 0.68 47,017.
weighted avg 0.92 0.79 0.82 47,017.

7.3 XGBoostClassifier

7.3.1 Pipeline

Code
xgb = config.make_xgb()

Check Listing 1 for more details.

7.3.2 Fit

Code
tic = time.time()
xgb_model = xgb.fit(xtrain, ytrain)
toc = time.time()
elapsed_time= toc - tic
print(f"Elapsed time: {elapsed_time} seconds")
print(f"Elapsed time: {elapsed_time / 60} minutes")

7.3.3 Pickle

Code
config.dump_pkl(
  path=os.path.expanduser(config.pkl_path) + "xgb_model.pkl",
  model=xgb_model
  )
Code
xgb = config.load_pkl(
  path=os.path.expanduser(config.pkl_path) + "xgb_model.pkl"
  )
xgb_model = xgb.best_estimator_

7.3.4 Classification

Code
ypred_xgb = xgb_model.predict(xtest)
cr_xgb = classification_report(
  ytest,
  ypred_xgb,
  # target_names=config.labels,
  digits=4,
  output_dict=True
  )
df_cr_xgb = pd.DataFrame.from_dict(cr_xgb).reset_index()
Code
library(reticulate)
df_cr <- py$df_cr_xgb %>% dplyr::rename(names = index)
cols <- df_cr %>% colnames()
df_cr %>% 
  pivot_longer(
    cols = -names,
    names_to = "metrics",
    values_to = "values"
  ) %>% 
  pivot_wider(
    names_from = names,
    values_from = values
  ) %>% 
  gt() %>%
  tab_header(
    title = "Confusion Matrix",
    subtitle = "Threshold optimization favoring recall"
  ) %>% 
  fmt_number(
    columns = c("precision", "recall", "f1-score", "support"),
    decimals = 2,
    drop_trailing_zeros = TRUE,
    drop_trailing_dec_mark = FALSE
  ) %>% 
  cols_align(
    align = "center",
    columns = c("precision", "recall", "f1-score", "support")
  ) %>% 
  cols_align(
    align = "left",
    columns = metrics
  ) %>% 
  cols_label(
    metrics = "Metrics",
    precision = "Precision",
    recall = "Recall",
    `f1-score` = "F1-Score",
    support = "Support"
  ) %>% 
  tab_options(
    table.width = pct(100)
    )
Table 13: Classification report XGB
Confusion Matrix
Threshold optimization favoring recall
Metrics Precision Recall F1-Score Support
0 0.99 0.82 0.9 41,986.
1 0.39 0.95 0.55 5,031.
accuracy 0.83 0.83 0.83 0.83
macro avg 0.69 0.89 0.72 47,017.
weighted avg 0.93 0.83 0.86 47,017.

8 Models comparison

Lets check with a plot the performance of all the models. The AUC ROC makes it possible. It is a plot useful for classification tasks. It plots False vs True Positive Rate.

Code
from sklearn.metrics import roc_curve, roc_auc_score
import matplotlib.pyplot as plt

plt.figure(figsize=(20, 15))

model_names = ["dtc", "rfc", "xgb"]
models_ = [dtc_model, rfc_model, xgb_model]
models = zip(model_names, models_)

for name, model in models:
    fpr, tpr, _ = roc_curve(ytest, model.predict_proba(xtest)[:, 1])
    auc_score = roc_auc_score(ytest, model.predict_proba(xtest)[:, 1])
    plt.plot(fpr, tpr, label=f'{name} (AUC: {auc_score:.2f})')
[<matplotlib.lines.Line2D object at 0x159a0f850>]
[<matplotlib.lines.Line2D object at 0x161d3f090>]
[<matplotlib.lines.Line2D object at 0x161d3f610>]
Code
plt.xlabel("False Positive Rate")
Text(0.5, 0, 'False Positive Rate')
Code
plt.ylabel("True Positive Rate")
Text(0, 0.5, 'True Positive Rate')
Code
plt.title("ROC Curves Comparison")
Text(0.5, 1.0, 'ROC Curves Comparison')
Code
plt.tight_layout()
plt.legend(loc="lower right", fontsize=12)
<matplotlib.legend.Legend object at 0x161d1d950>
Code
plt.show()

Code
plt.close()

From the ROC AUC, the best model is the xgboost, because the Area Under the Curve is higher.

9 Explainability

What if the model predict a low creditworthiness and the business wants to know why? Shapley values make this possible. It comes from cooperative game theory: for more details, check this link and this link.

For our purpose, I have found 3 observations we can use. It is also useful to separate the preprocessor and classificator part of the pipeline.

Code
model = xgb_model
preprocessor = model.named_steps["preprocessor"]
clf = model.named_steps["xgb"]
idx = [15454, 1284, 30305]
Code
# retain features name
xtrain_processed = preprocessor.transform(xtrain)
xtrain_processed = pd.DataFrame(xtrain_processed, columns=config.features)
# retain features name
xtest_processed = preprocessor.transform(xtest)
xtest_processed = pd.DataFrame(xtest_processed, columns=config.features)
# retain features name
clf.get_booster().feature_names = config.features
# convert to DMatrix
dtrain = DMatrix(xtrain_processed, label=ytrain)
dtest = DMatrix(xtest_processed, label=ytest)
# usefol idxs
idx = [15454, 1284, 30305]

x_processed = preprocessor.transform(x)
Code
from xgboost import plot_importance

# importance plot
plot_importance(
    booster=clf.get_booster(),
    grid=False,
    importance_type="gain",
    title="Feature Importance by Gain",
    values_format="{v:.2f}"
)
<Axes: title={'center': 'Feature Importance by Gain'}, xlabel='F score', ylabel='Features'>
Code
plt.tight_layout()
plt.show()

The importance plot tells us the information gain each feature gives to the model prediction. The 3 most important features are also the most logical one:

  • Total income amount.
  • Number of childrens.
  • Type of housing.

Importance provides a score that indicates how useful useful each feature is in the construcition of the boosted decision trees. The more an attribute is used to make key decisions with decision trees, the higher its relative importance.

Code
# force plot
plt.figure(figsize=(10, 8))
plot_force = shap.plots.force(
    base_value=explainer.expected_value,
    shap_values=shap_values[idx[0], :],
    # features=None,
    feature_names=config.features,
    # out_names=None,
    # link='identity', # "logit"
    plot_cmap='RdBu',
    matplotlib=True,
    show=True
)
plt.savefig("plot_force.jpeg", dpi=200)

# calculate specific shap explainer for waterfall plot
explainer_ = shap.Explainer(model=xgb_clf)
shap_values_ = explainer(x_processed)
shap_values_.feature_names = config.features
# waterfall plot
shap.plots.waterfall(
    shap_values_[0],
    max_display=len(config.features)
)
plt.savefig("plot_waterfall.jpeg", dpi=200)
# bar plot
shap.plots.bar(
    shap_values_[0],
    max_display=len(config.features)
)
plt.savefig("plot_bar.jpeg", dpi=200)

Force plot

The force plot explains the contribution of features to a specific model prediction.

  • Base value is the model expected output or mean prediction over the dataset. It is the prediction the model would make without any specific input features. It is the starting point of the plot.

  • Feature Contributions. The SHAP values represent how much each feature contributes to the difference between the vase value and the final prediction for a specific observation.

  • Positive contributions. Features pushing the prediction higher. Shown in red.

  • Negative contributions. Features pulling the prediction lower. Shown in blue. The magnitude of each bar indicates the size of the feature’s contribution.

  • Final Prediction. Endpoint of the plot. It is the sum of the base value and all the SHAP values for that observation.

Waterfall plot

Waterfall plots displays explanations for individual predictions. The bottom of a waterfall plot starts as the expected value of the model output, and then each row shows how the positive (red) or negative (blue) contribution of each feature moves the value from the expected model output over the background dataset to the model output for this prediction.

Important

Due to venv incompatibilities between shap and other packages, the shap evaluation has been made on Kaggle. The plots are the results of the notebook.