1 Introduction
Tasks:
- Train a model to classify spam.
- Find the main topic in the spam emails.
- Calculate semantic distance of the spam topics.
- Extract the ORG from ham emilas.
2 Import
2.1 R libraries
2.2 Python packages
Code
import pandas as pd
import numpy as np
import spacy
import nltk
import string
import gensim
import gensim.corpora as corpora
import gensim.downloader
from nltk.corpus import stopwords
from gensim.models import Word2Vec
from scipy.spatial.distance import cosine
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.neural_network import MLPClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from collections import Counter
# glove vector
glove_vector = gensim.downloader.load("glove-wiki-gigaword-300")
# nltk stopwords
nltk.download('stopwords')
stop_words = stopwords.words('english')
2.3 Config class
Code
class Config():
def __init__(self):
"""
CLass initialize function.
"""
self.path="/Users/simonebrazzi/R/blog/posts/spam_detection/spam_dataset.csv"
self.random_state=42
def get_lemmas(self, doc):
"""
List comprehension to get lemmas. It performs:\n
1. Lowercase.\n
2. Stop words removal\n.
3. Whitespaces removal.\n
4. Remove the word 'subject'.\n
5. Digits removal.\n
6. Get only words with length >= 5.\n
"""
lemmas = [
[
t.lemma_ for t in d
if (text := t.text.lower()) not in punctuation
and text not in stop_words
and not t.is_space
and text != "subject"
and not t.is_digit
and len(text) >=5
]
for d in doc
]
return lemmas
def get_entities(self, doc):
"""
List comprehension to get the lemmas which have entity type 'ORG'.
"""
entities = [
[
t.lemma_ for t in d
if (text := t.text.lower()) not in punctuation
and text not in stop_words
and not t.is_space
and text != "subject"
and not t.is_digit
and len(text) >=5
and t.ent_type_ == "ORG"
]
for d in doc
]
return entities
def get_sklearn_mlp(self, activation, solver, max_iter, hidden_layer_sizes, tol):
"""
It initialize the sklearn MLPClassifier.
"""
mlp = mlp = MLPClassifier(
activation=activation,
solver=solver,
max_iter=max_iter,
hidden_layer_sizes=hidden_layer_sizes,
tol=tol,
verbose=True
)
return mlp
def get_lda(self, corpus, id2word, num_topics, passes, workers, random_state):
"""
Initialize LDA.
"""
lda = gensim.models.LdaMulticore(
corpus=corpus,
id2word=id2word,
num_topics=num_topics,
passes=passes,
workers=workers,
random_state=self.random_state
)
return lda
config = Config()
3 Dataset
To further perform analysis also in R, here we assign the df to a variable using reticulate
library.
3.1 EDA
First things first, Exploratory Data Analysis. Considering we are going to perform a classification, is interesting to check if our dataset is unbalanced.
Code
df_g <- df %>%
summarise(
freq_abs = n(),
freq_rel = n() / nrow(df),
.by = label
)
df_g %>%
gt() %>%
fmt_auto() %>%
cols_width(
label ~ pct(20),
freq_abs ~ pct(35),
freq_rel ~ pct(35)
) %>%
cols_align(
align = "center",
columns = c(freq_abs, freq_rel)
) %>%
tab_header(
title = "Label frequency",
subtitle = "Absolute and relative frequencies"
) %>%
cols_label(
label = "Label",
freq_abs = "Absolute frequency",
freq_rel = "Relative frequency"
) %>%
tab_options(
table.width = pct(100)
)
It is useful also a visual cue.
The dataset is not balanced, so it could be relevant when training the model for classification. Depending on the model performance, we know what we could investigate first.
4 Preprocessing
In case of text, the preprocessing is fundamental: the computer does not understand the semantic or grammatical meaning of words. In case of text preprocessing, we follow the following steps:
- Lowercasing.
- Punctuation removal.
- Lemmatization.
- Tokenization.
- Stopwords removal.
Using SpaCy
, we can applied these steps. The method nlp.pipe
improves the performance and returns a generator. It yields a Doc
objects, not a list. To use it as a list, it has to be defined as such. To speed up the process, is it possible to enable the multi process method in nlp.pipe
. But, what does the variable nlp
stand for? It load a spaCy model: we are going to use the en_core_web_lg
.
- Language: EN; english.
- Type: CORE; vocabulary, syntax, entities, vectors
- Genre: WEB; written text (blogs, news, comments).
- Size: LG; large (560 mB).
Check this link for the documentation about this model.
I had chosen this model, even if it is the biggest, to get the full potential of it.
The specific preprocessing in this case should check these steps:
- Remove punctuation.
- Remove stop words.
- Remove spaces.
- Remove “subject” token.
- Lemmatization.
To improve code performance, these are the most noticible points:
- When iterating over a collection of unique elements,
set()
performs better thenlist()
. The underlying hash table structure allows for swift traversal. This is particularly noticible when the df dimension increase. - List comprehension, which performs better then for loops and are much more readable in some context.
- The walrus operator
:=
. It is a syntax which lets assign variables in the middle of expressions. It avoids redundant calculations and improves readability.
5 Tasks
5.1 Classification
The text is already preprocessed as list of lemmas. For the classification task, it is necessary to convert it as a string.
5.1.1 Features
As said, the machine does not understand human readable text. It has to be transformed. The best approach is to vectorize it with TfidfVectorizer()
. It is a tool for converting text into a matrix of TF-IDF features. The TermFrequency-InverseDocumentFrequency is a statistical method. It is a measure of importance of a word in a document, part of a corpus, adjusted for the frequency in the corpus. The model vectorize a word by multiplying the word Term Frequency
\[ TF = \frac{word\ frequency\ in\ document}{total\ words\ in\ document} \] with the Inverse Document Frequency
\[ IDF = log(\frac{total\ number\ documents}{documents\ containing\ the\ word}) \] The final result is
\[ TF-IDF = TF * IDF \]
The resulting score represents the importance of a word. It dependes on the word frequency both in a specific document and in the corpus.
An example can be useful. If a word t appears 20 times in a document of 100 words, we have
\[ TF = \frac{20}{100}=0.2 \]
If there are 10.000 documents in the corpus and 100 documents contains the term t
\[ IDF = log(\frac{10000}{100})=2 \]
This means the score is
\[ TF-IDF=0.2*2=0.4 \]
5.2 Split
Not much to say about this: a best practice which let evaluate the performance of our model on new data.
5.2.1 Model
The model is the MLPClassifier()
. It is a Multi Perceptron Layer Classifier.
It is an Artificial Neural Network used for classification. It consists of multiple layers of nodes, called perceptrons. For further reading, see the documentation.
5.2.2 Fit
5.2.3 Predict
5.2.4 Classification report
Considering we are doing a classificatoin, one method to evaluate the performance is the classification report. It summarize the performance of the model comparing true and predicted labels, showing not only the metrics (precision, recall and F1-score) but also the support.
Code
library(reticulate)
df_cr <- py$df_cr %>% dplyr::rename(names = index)
cols <- df_cr %>% colnames()
df_cr %>%
pivot_longer(
cols = -names,
names_to = "metrics",
values_to = "values"
) %>%
pivot_wider(
names_from = names,
values_from = values
) %>%
gt() %>%
tab_header(
title = "Confusion Matrix",
subtitle = "Sklearn MLPClassifier"
) %>%
fmt_number(
columns = c("precision", "recall", "f1-score", "support"),
decimals = 2,
drop_trailing_zeros = TRUE,
drop_trailing_dec_mark = FALSE
) %>%
cols_align(
align = "center",
columns = c("precision", "recall", "f1-score", "support")
) %>%
cols_align(
align = "left",
columns = metrics
) %>%
cols_label(
metrics = "Metrics",
precision = "Precision",
recall = "Recall",
`f1-score` = "F1-Score",
support = "Support"
)
Even if the model is not fitted for an unbalanced dataset, it is not affecting the performance. Precision and Recall are high, so much that is could seems to be overfitted.
5.3 Topic Modeling for spam content
Topic modeling in nlp can count on the Latent Dirilicht Model. It is a generative model used to get the topics which occur in a set of documents.
The LDA model has:
- Input: a corpus of text documents, preprocessed as tokenized and cleaned words. We have this in the lemmas column.
- Output: a distribution of topics for each document and one of words for each topic.
For further reading, you can find the paper in the Table of Contents or at this link.
5.3.1 Dataset
Filter data to have a spam dataframe and create a variable with the lemmas column to work with.
5.3.2 Create corpus
LDA algortihm needs the corpus as a bag of word.
5.3.3 Model
The model will return a user defined number of topics. For each of it, it will return a user defined number of words ad the probability of each of them.
Code
lda = config.get_lda(
corpus=corpus,
id2word=id2word,
num_topics=10,
passes=10, # number of times the algorithm see the corpus
workers=4, # parellalize
random_state=42
)
topic_words = lda.show_topics(num_topics=10, num_words=5, formatted=False)
# Iterate over topic_words to extract the data
data = []
data = [
(topic, w, p)
for topic, words in topic_words
for w, p in words
]
topics_df = pd.DataFrame(data, columns=['topic', 'word', 'proba'])
Code
py$topics_df %>%
gt() %>%
tab_header(
title = "Words and probabilities by topics"
) %>%
fmt_auto() %>%
cols_width(
topic ~ pct(33),
word ~ pct(33),
proba ~ pct(33)
) %>%
cols_align(
align = "center",
columns = c(topic, word, proba)
) %>%
cols_label(
topic = "Topic",
word = "Word",
proba = "Probability"
) %>%
tab_options(
table.width = pct(100)
)
5.4 Semantic distance between topics
The semantic distance requires a documents made of strings. Using a dict
, we can extract the topics and the words. The dict has the topics as keys and the words as a list of words. The documents can be created using the .join()
.
Code
py$cosine_sim_df %>%
gt() %>%
fmt_auto() %>%
cols_align(
align = "center",
columns = c(`0`, `1`, `2`, `3`, `4`, `5`, `6`, `7`, `8`, `9`)
) %>%
cols_label(
`0` = "Topic 0",
`1` = "Topic 1",
`2` = "Topic 2",
`3` = "Topic 3",
`4` = "Topic 4",
`5` = "Topic 5",
`6` = "Topic 6",
`7` = "Topic 7",
`8` = "Topic 8",
`9` = "Topic 9",
) %>%
tab_options(
table.width = pct(100)
)
5.5 Organization of “HAM” mails
5.5.1 Create “HAM” df
5.5.2 Get ham lemmas which have ORG entity
Code
word_freqs_df %>%
arrange(desc(freq)) %>%
head(10) %>%
gt() %>%
tab_header(
title = "Top 10 ham words ",
subtitle = "by frequency"
) %>%
fmt_auto() %>%
cols_width(
word ~ pct(50),
freq ~ pct(50)
) %>%
cols_align(
align = "center",
columns = c(word, freq)
) %>%
cols_label(
word = "Word",
freq = "Frequency"
) %>%
tab_options(
table.width = pct(100)
)