Spam Detection

NLP
code
NLP
Natural Language, Processing
Python, R
Author

Simone Brazzi

Published

September 5, 2024

1 Introduction

Tasks:

  • Train a model to classify spam.
  • Find the main topic in the spam emails.
  • Calculate semantic distance of the spam topics.
  • Extract the ORG from ham emilas.

2 Import

2.1 R libraries

Code
library(tidyverse, verbose = FALSE)
library(ggplot2)
library(gt)
library(reticulate)
library(plotly)

# renv::use_python("/Users/simonebrazzi/venv/blog/bin/python3")

2.2 Python packages

Code
import pandas as pd
import numpy as np
import spacy
import nltk
import string
import gensim
import gensim.corpora as corpora
import gensim.downloader

from nltk.corpus import stopwords
from gensim.models import Word2Vec
from scipy.spatial.distance import cosine
from sklearn.metrics.pairwise import cosine_similarity

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.neural_network import MLPClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

from collections import Counter
# glove vector
glove_vector = gensim.downloader.load("glove-wiki-gigaword-300")
# nltk stopwords
nltk.download('stopwords')
stop_words = stopwords.words('english')

2.3 Config class

Code
class Config():
  def __init__(self):
    
    """
    CLass initialize function.
    """
    
    self.path="/Users/simonebrazzi/R/blog/posts/spam_detection/spam_dataset.csv"
    self.random_state=42
  
  def get_lemmas(self, doc):
    
    """
    List comprehension to get lemmas. It performs:\n
      1. Lowercase.\n
      2. Stop words removal\n.
      3. Whitespaces removal.\n
      4. Remove the word 'subject'.\n
      5. Digits removal.\n
      6. Get only words with length >= 5.\n
    """
    
    lemmas = [
      [
        t.lemma_ for t in d 
        if (text := t.text.lower()) not in punctuation 
        and text not in stop_words 
        and not t.is_space 
        and text != "subject" 
        and not t.is_digit
        and len(text) >=5
        ]
        for d in doc
        ]
    return lemmas
  
  def get_entities(self, doc):
    
    """
    List comprehension to get the lemmas which have entity type 'ORG'.
    """
    
    entities = [
    [
      t.lemma_ for t in d 
      if (text := t.text.lower()) not in punctuation 
      and text not in stop_words 
      and not t.is_space 
      and text != "subject" 
      and not t.is_digit
      and len(text) >=5
      and t.ent_type_ == "ORG"
      ]
      for d in doc
      ]
    return entities
  
  def get_sklearn_mlp(self, activation, solver, max_iter, hidden_layer_sizes, tol):
    
    """
    It initialize the sklearn MLPClassifier.
    """
    
    mlp = mlp = MLPClassifier(
      activation=activation,
      solver=solver,
      max_iter=max_iter,
      hidden_layer_sizes=hidden_layer_sizes,
      tol=tol,
      verbose=True
      )
    return mlp
  
  def get_lda(self, corpus, id2word, num_topics, passes, workers, random_state):
    
    """
    Initialize LDA.
    """
    
    lda = gensim.models.LdaMulticore(
      corpus=corpus,
      id2word=id2word,
      num_topics=num_topics,
      passes=passes,
      workers=workers,
      random_state=self.random_state
      )
    return lda
  
config = Config()

3 Dataset

Code
df = pd.read_csv(config.path, index_col=0)

To further perform analysis also in R, here we assign the df to a variable using reticulate library.

Code
library(reticulate)
df <- py$df

3.1 EDA

First things first, Exploratory Data Analysis. Considering we are going to perform a classification, is interesting to check if our dataset is unbalanced.

Table 1: Absolute and relative frequencies
Code
df_g <- df %>% 
  summarise(
    freq_abs = n(),
    freq_rel = n() / nrow(df),
    .by = label
    )

df_g %>% 
  gt() %>% 
  fmt_auto() %>% 
  cols_width(
    label ~ pct(20),
    freq_abs ~ pct(35),
    freq_rel ~ pct(35)
    ) %>% 
  cols_align(
    align = "center",
    columns = c(freq_abs, freq_rel)
  ) %>% 
  tab_header(
    title = "Label frequency",
    subtitle = "Absolute and relative frequencies"
  ) %>% 
  cols_label(
    label = "Label",
    freq_abs = "Absolute frequency",
    freq_rel = "Relative frequency"
  ) %>% 
  tab_options(
    table.width = pct(100)
    )

It is useful also a visual cue.

Code
g <- ggplot(df_g) +
  geom_col(aes(x = freq_abs, y = label, fill = label)) +
  scale_fill_brewer(palette = "Set1", direction = -1) +
  theme_minimal() +
  ggtitle("Absolute frequency barchart") +
  xlab("Absolute frequency") +
  ylab("Label") +
  labs(fill = "Label")
ggplotly(g)

The dataset is not balanced, so it could be relevant when training the model for classification. Depending on the model performance, we know what we could investigate first.

4 Preprocessing

In case of text, the preprocessing is fundamental: the computer does not understand the semantic or grammatical meaning of words. In case of text preprocessing, we follow the following steps:

  1. Lowercasing.
  2. Punctuation removal.
  3. Lemmatization.
  4. Tokenization.
  5. Stopwords removal.

Using SpaCy, we can applied these steps. The method nlp.pipe improves the performance and returns a generator. It yields a Doc objects, not a list. To use it as a list, it has to be defined as such. To speed up the process, is it possible to enable the multi process method in nlp.pipe. But, what does the variable nlp stand for? It load a spaCy model: we are going to use the en_core_web_lg.

  • Language: EN; english.
  • Type: CORE; vocabulary, syntax, entities, vectors
  • Genre: WEB; written text (blogs, news, comments).
  • Size: LG; large (560 mB).

Check this link for the documentation about this model.

I had chosen this model, even if it is the biggest, to get the full potential of it.

Code
# load en_core_web_lg
nlp = spacy.load("en_core_web_lg")
doc = list(nlp.pipe(df.text, n_process=4))

The specific preprocessing in this case should check these steps:

  1. Remove punctuation.
  2. Remove stop words.
  3. Remove spaces.
  4. Remove “subject” token.
  5. Lemmatization.

To improve code performance, these are the most noticible points:

  • When iterating over a collection of unique elements, set() performs better then list(). The underlying hash table structure allows for swift traversal. This is particularly noticible when the df dimension increase.
  • List comprehension, which performs better then for loops and are much more readable in some context.
  • The walrus operator :=. It is a syntax which lets assign variables in the middle of expressions. It avoids redundant calculations and improves readability.
Code
punctuation = set(string.punctuation)
stop_words = set(stop_words)

lemmas = config.get_lemmas(doc)
entities = config.get_entities(doc)

# assignment to a df column
df["lemmas"] = lemmas
df["entities"] = entities

5 Tasks

5.1 Classification

The text is already preprocessed as list of lemmas. For the classification task, it is necessary to convert it as a string.

Code
df["feature"] = df.lemmas.apply(lambda x : " ".join(x))

5.1.1 Features

As said, the machine does not understand human readable text. It has to be transformed. The best approach is to vectorize it with TfidfVectorizer(). It is a tool for converting text into a matrix of TF-IDF features. The TermFrequency-InverseDocumentFrequency is a statistical method. It is a measure of importance of a word in a document, part of a corpus, adjusted for the frequency in the corpus. The model vectorize a word by multiplying the word Term Frequency

\[ TF = \frac{word\ frequency\ in\ document}{total\ words\ in\ document} \] with the Inverse Document Frequency

\[ IDF = log(\frac{total\ number\ documents}{documents\ containing\ the\ word}) \] The final result is

\[ TF-IDF = TF * IDF \]

The resulting score represents the importance of a word. It dependes on the word frequency both in a specific document and in the corpus.

Note

An example can be useful. If a word t appears 20 times in a document of 100 words, we have

\[ TF = \frac{20}{100}=0.2 \]

If there are 10.000 documents in the corpus and 100 documents contains the term t

\[ IDF = log(\frac{10000}{100})=2 \]

This means the score is

\[ TF-IDF=0.2*2=0.4 \]

Code
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(df['feature'])
y = df.label

5.2 Split

Not much to say about this: a best practice which let evaluate the performance of our model on new data.

Code
xtrain, xtest, ytrain, ytest = train_test_split(X, y, test_size=.2, random_state=42)

5.2.1 Model

The model is the MLPClassifier(). It is a Multi Perceptron Layer Classifier.

MLPClassifier

It is an Artificial Neural Network used for classification. It consists of multiple layers of nodes, called perceptrons. For further reading, see the documentation.

Code
mlp = config.get_sklearn_mlp(
  activation="logistic",
  solver="adam",
  max_iter=100,
  hidden_layer_sizes=(100,),
  tol=.005
  )

5.2.2 Fit

Code
mlp.fit(xtrain, ytrain)

5.2.3 Predict

Code
ypred = mlp.predict(xtest)

5.2.4 Classification report

Considering we are doing a classificatoin, one method to evaluate the performance is the classification report. It summarize the performance of the model comparing true and predicted labels, showing not only the metrics (precision, recall and F1-score) but also the support.

Code
cr = classification_report(
  ytest,
  ypred,
  target_names=["spam", "ham"],
  digits=4,
  output_dict=True
  )
df_cr = pd.DataFrame.from_dict(cr).reset_index()
Code
library(reticulate)

df_cr <- py$df_cr %>% dplyr::rename(names = index)
cols <- df_cr %>% colnames()
df_cr %>% 
  pivot_longer(
    cols = -names,
    names_to = "metrics",
    values_to = "values"
  ) %>% 
  pivot_wider(
    names_from = names,
    values_from = values
  ) %>% 
  gt() %>%
  tab_header(
    title = "Confusion Matrix",
    subtitle = "Sklearn MLPClassifier"
  ) %>% 
  fmt_number(
    columns = c("precision", "recall", "f1-score", "support"),
    decimals = 2,
    drop_trailing_zeros = TRUE,
    drop_trailing_dec_mark = FALSE
  ) %>% 
  cols_align(
    align = "center",
    columns = c("precision", "recall", "f1-score", "support")
  ) %>% 
  cols_align(
    align = "left",
    columns = metrics
  ) %>% 
  cols_label(
    metrics = "Metrics",
    precision = "Precision",
    recall = "Recall",
    `f1-score` = "F1-Score",
    support = "Support"
  )

Even if the model is not fitted for an unbalanced dataset, it is not affecting the performance. Precision and Recall are high, so much that is could seems to be overfitted.

5.3 Topic Modeling for spam content

Topic modeling in nlp can count on the Latent Dirilicht Model. It is a generative model used to get the topics which occur in a set of documents.

The LDA model has:

  • Input: a corpus of text documents, preprocessed as tokenized and cleaned words. We have this in the lemmas column.
  • Output: a distribution of topics for each document and one of words for each topic.

For further reading, you can find the paper in the Table of Contents or at this link.

5.3.1 Dataset

Filter data to have a spam dataframe and create a variable with the lemmas column to work with.

Code
spam = df[df.label == "spam"]
x = spam.lemmas

5.3.2 Create corpus

LDA algortihm needs the corpus as a bag of word.

Code
id2word = corpora.Dictionary(x)
corpus = [id2word.doc2bow(t) for t in x]

5.3.3 Model

The model will return a user defined number of topics. For each of it, it will return a user defined number of words ad the probability of each of them.

Code
lda = config.get_lda(
  corpus=corpus,
  id2word=id2word,
  num_topics=10,
  passes=10, # number of times the algorithm see the corpus
  workers=4, # parellalize
  random_state=42
  )
topic_words = lda.show_topics(num_topics=10, num_words=5, formatted=False)

# Iterate over topic_words to extract the data
data = []
data = [
    (topic, w, p)
    for topic, words in topic_words
    for w, p in words
    ]
topics_df = pd.DataFrame(data, columns=['topic', 'word', 'proba'])
Table 2: Topics id, words and probabilities
Code
py$topics_df %>% 
  gt() %>% 
  tab_header(
    title = "Words and probabilities by topics"
  ) %>% 
  fmt_auto() %>% 
  cols_width(
    topic ~ pct(33),
    word ~ pct(33),
    proba ~ pct(33)
    ) %>% 
  cols_align(
    align = "center",
    columns = c(topic, word, proba)
  ) %>% 
  cols_label(
    topic = "Topic",
    word = "Word",
    proba = "Probability"
  ) %>% 
  tab_options(
    table.width = pct(100)
  )

5.4 Semantic distance between topics

The semantic distance requires a documents made of strings. Using a dict, we can extract the topics and the words. The dict has the topics as keys and the words as a list of words. The documents can be created using the .join().

Code
topics_dict = {}
for topics, words in topic_words:
  topics_dict[topics] = [w[0] for w in words]

documents = [" ".join(words) for words in topics_dict.values()]
Code
vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(documents)
cosine_sim_matrix = cosine_similarity(tfidf_matrix, dense_output=True)
topics = list(topics_dict.keys())
cosine_sim_df = pd.DataFrame(cosine_sim_matrix, index=topics, columns=topics)
Table 3: Cosine similarity matrix
Code
py$cosine_sim_df %>% 
  gt() %>% 
  fmt_auto() %>% 
  cols_align(
    align = "center",
    columns = c(`0`, `1`, `2`, `3`, `4`, `5`, `6`, `7`, `8`, `9`)
  ) %>% 
  cols_label(
    `0` = "Topic 0",
    `1` = "Topic 1",
    `2` = "Topic 2",
    `3` = "Topic 3",
    `4` = "Topic 4",
    `5` = "Topic 5",
    `6` = "Topic 6",
    `7` = "Topic 7",
    `8` = "Topic 8",
    `9` = "Topic 9",
  ) %>% 
  tab_options(
    table.width = pct(100)
    )

5.5 Organization of “HAM” mails

5.5.1 Create “HAM” df

5.5.2 Get ham lemmas which have ORG entity

Code
ham = df[df.label == "ham"]
x = ham.entities
Code
from collections import Counter

# Flatten the list of lists and create a Counter object
d = Counter([i for e in x for i in e])

freq_df = pd.DataFrame(d.items(), columns=["word", "freq"])
Code
library("wordcloud")
library("RColorBrewer")

set.seed(42)
word_freqs_df <- py$freq_df

wordcloud(
  words = word_freqs_df$word,
  freq = word_freqs_df$freq,
  min.freq = 1,
  max.words = 100, # nrow(word_freqs_df),
  random.order = FALSE,
  rot.per = .3,
  colors = brewer.pal(n = 8, name = "Dark2")
  )
Table 4: Top 10 words by frequency
Code
word_freqs_df %>% 
  arrange(desc(freq)) %>% 
  head(10) %>% 
  gt() %>%                                        
  tab_header(
    title = "Top 10 ham words ",
    subtitle = "by frequency"
  ) %>% 
  fmt_auto() %>% 
  cols_width(
    word ~ pct(50),
    freq ~ pct(50)
    ) %>% 
  cols_align(
    align = "center",
    columns = c(word, freq)
  ) %>% 
  cols_label(
    word = "Word",
    freq = "Frequency"
  ) %>% 
  tab_options(
    table.width = pct(100)
  )