1 Introduction

Build a model that can filter user comments based on the degree of language maliciousness:

Preprocess the text by eliminating the set of tokens that do not make significant contribution at the semantic level.
Transform the text corpus into sequences.
Build a Deep Learning model including recurrent layers for a multilabel classification task.
At prediction time, the model should return a vector containing a 1 or a 0 at each label in the dataset (toxic, severe_toxic, obscene, threat, insult, identity_hate). In this way, a non-harmful comment will be classified by a vector of only 0s [0,0,0,0,0]. In contrast, a dangerous comment will exhibit at least a 1 among the 6 labels.

2 Setup

Leveraging Quarto and RStudio, I will setup an R and Python enviroment.

2.1 Import R libraries

Import R libraries. These will be used for both the rendering of the document and data analysis. The reason is I prefer ggplot2 over matplotlib. I will also use colorblind safe palettes.

Code

library(tidyverse, verbose = FALSE)
library(tidymodels, verbose = FALSE)
library(reticulate)
library(ggplot2)
library(plotly)
library(RColorBrewer)
library(bslib)
library(Metrics)
library(gt)


Sys.setenv(RETICULATE_PYTHON = "~/R/blog/posts/toxic_comment_filter/.venv/bin/python")

2.2 Import Python packages

Code

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import tensorflow as tf
import keras
import keras_nlp

from keras.backend import clear_session
from keras.models import Model, load_model
from keras.layers import TextVectorization, Input, Dense, Embedding, Dropout, GlobalAveragePooling1D, LSTM, Bidirectional, GlobalMaxPool1D, Flatten, Attention, LayerNormalization
from keras.metrics import Precision, Recall, AUC, SensitivityAtSpecificity, SpecificityAtSensitivity, F1Score

from sklearn.model_selection import train_test_split, KFold
from sklearn.metrics import multilabel_confusion_matrix, classification_report, ConfusionMatrixDisplay, precision_recall_curve, f1_score, recall_score, roc_auc_score

Code

print("Num GPUs Available: ", len(tf.config.list_physical_devices('GPU')))

Num GPUs Available:  1

Create a Config class to store all the useful parameters for the model and for the project.

2.3 Class Config

I created a class with all the basic configuration of the model, to improve the readability.

Code

class Config():
    def __init__(self):
        self.url = "https://s3.eu-west-3.amazonaws.com/profession.ai/datasets/Filter_Toxic_Comments_dataset.csv"
        self.max_tokens = 20000
        self.output_sequence_length = 911 # check the analysis done to establish this value
        self.embedding_dim = 128
        self.batch_size = 32
        self.epochs = 100
        self.temp_split = 0.3
        self.test_split = 0.5
        self.random_state = 42
        self.total_samples = 159571 # total train samples
        self.train_samples = 111699
        self.val_samples = 23936
        self.features = 'comment_text'
        self.labels = ['toxic', 'severe_toxic', 'obscene', 'threat', 'insult', 'identity_hate']
        self.new_labels = ['toxic', 'severe_toxic', 'obscene', 'threat', 'insult', 'identity_hate', "clean"]
        self.label_mapping = {label: i for i, label in enumerate(self.labels)}
        self.new_label_mapping = {label: i for i, label in enumerate(self.labels)}
        self.path = "/Users/simonebrazzi/R/blog/posts/toxic_comment_filter/history/f1score/"
        self.model =  self.path + "model_f1.keras"
        self.checkpoint = self.path + "checkpoint.lstm_model_f1.keras"
        self.history = self.path + "lstm_model_f1.xlsx"
        
        self.metrics = [
            Precision(name='precision'),
            Recall(name='recall'),
            AUC(name='auc', multi_label=True, num_labels=len(self.labels)),
            F1Score(name="f1", average="macro")
            
        ]
    def get_early_stopping(self):
        early_stopping = keras.callbacks.EarlyStopping(
            monitor="val_f1", # "val_recall",
            min_delta=0.2,
            patience=10,
            verbose=0,
            mode="max",
            restore_best_weights=True,
            start_from_epoch=3
        )
        return early_stopping

    def get_model_checkpoint(self, filepath):
        model_checkpoint = keras.callbacks.ModelCheckpoint(
            filepath=filepath,
            monitor="val_f1", # "val_recall",
            verbose=0,
            save_best_only=True,
            save_weights_only=False,
            mode="max",
            save_freq="epoch"
        )
        return model_checkpoint

    def find_optimal_threshold_cv(self, ytrue, yproba, metric, thresholds=np.arange(.05, .35, .05), n_splits=7):

      # instantiate KFold
      kf = KFold(n_splits=n_splits, shuffle=True, random_state=42)
      threshold_scores = []

      for threshold in thresholds:

        cv_scores = []
        for train_index, val_index in kf.split(ytrue):

          ytrue_val = ytrue[val_index]
          yproba_val = yproba[val_index]

          ypred_val = (yproba_val >= threshold).astype(int)
          score = metric(ytrue_val, ypred_val, average="macro")
          cv_scores.append(score)

        mean_score = np.mean(cv_scores)
        threshold_scores.append((threshold, mean_score))

        # Find the threshold with the highest mean score
        best_threshold, best_score = max(threshold_scores, key=lambda x: x[1])
      return best_threshold, best_score

config = Config()

3 Data

The dataset is accessible using tf.keras.utils.get_file to get the file from the url. N.B. For reproducibility purpose, I also downloaded the dataset. There was time in which the link was not available.

Code

# df = pd.read_csv(config.path)
file = tf.keras.utils.get_file("Filter_Toxic_Comments_dataset.csv", config.url)
df = pd.read_csv(file)

Code

library(reticulate)

py$df %>%
  tibble() %>% 
  head(5) %>% 
  gt() %>% 
  tab_header(
    title = "First five observations"
  ) %>% 
   cols_align(
    align = "center",
    columns = c("toxic", "severe_toxic", "obscene", "threat", "insult", "identity_hate", "sum_injurious")
  ) %>% 
  cols_align(
    align = "left",
    columns = comment_text
  ) %>% 
  cols_label(
    comment_text = "Comments",
    toxic = "Toxic",
    severe_toxic = "Severe Toxic",
    obscene = "Obscene",
    threat = "Threat",
    insult = "Insult",
    identity_hate = "Identity Hate",
    sum_injurious = "Sum Injurious"
    )

Table 1: First 5 elemtns

First five observations
Comments	Toxic	Severe Toxic	Obscene	Threat	Insult	Identity Hate	Sum Injurious
Explanation Why the edits made under my username Hardcore Metallica Fan were reverted? They weren't vandalisms, just closure on some GAs after I voted at New York Dolls FAC. And please don't remove the template from the talk page since I'm retired now.89.205.38.27	0	0	0	0	0	0	0
D'aww! He matches this background colour I'm seemingly stuck with. Thanks. (talk) 21:51, January 11, 2016 (UTC)	0	0	0	0	0	0	0
Hey man, I'm really not trying to edit war. It's just that this guy is constantly removing relevant information and talking to me through edits instead of my talk page. He seems to care more about the formatting than the actual info.	0	0	0	0	0	0	0
" More I can't make any real suggestions on improvement - I wondered if the section statistics should be later on, or a subsection of ""types of accidents"" -I think the references may need tidying so that they are all in the exact same format ie date format etc. I can do that later on, if no-one else does first - if you have any preferences for formatting style on references or want to do it yourself please let me know. There appears to be a backlog on articles for review so I guess there may be a delay until a reviewer turns up. It's listed in the relevant form eg Wikipedia:Good_article_nominations#Transport "	0	0	0	0	0	0	0
You, sir, are my hero. Any chance you remember what page that's on?	0	0	0	0	0	0	0

Lets create a clean variable for EDA purpose: I want to visually see how many observation are clean vs the others labels.

Code

df.loc[df.sum_injurious == 0, "clean"] = 1
df.loc[df.sum_injurious != 0, "clean"] = 0

3.1 EDA

First a check on the dataset to find possible missing values and imbalances.

3.1.1 Frequency

Code

library(reticulate)
df_r <- py$df
new_labels_r <- py$config$new_labels

df_r_grouped <- df_r %>% 
  select(all_of(new_labels_r)) %>%
  pivot_longer(
    cols = all_of(new_labels_r),
    names_to = "label",
    values_to = "value"
  ) %>% 
  group_by(label) %>%
  summarise(count = sum(value)) %>% 
  mutate(freq = round(count / sum(count), 4))

df_r_grouped %>% 
  gt() %>% 
  tab_header(
    title = "Labels frequency",
    subtitle = "Absolute and relative frequency"
  ) %>% 
  fmt_number(
    columns = "count",
    drop_trailing_zeros = TRUE,
    drop_trailing_dec_mark = TRUE,
    use_seps = TRUE
  ) %>% 
  fmt_percent(
    columns = "freq",
    decimals = 2,
    drop_trailing_zeros = TRUE,
    drop_trailing_dec_mark = FALSE
  ) %>% 
  cols_align(
    align = "center",
    columns = c("count", "freq")
  ) %>% 
  cols_align(
    align = "left",
    columns = label
  ) %>% 
  cols_label(
    label = "Label",
    count = "Absolute Frequency",
    freq = "Relative frequency"
  )

Table 2: Absolute and relative labels frequency

Labels frequency
Absolute and relative frequency
Label	Absolute Frequency	Relative frequency
clean	143,346	80.33%
identity_hate	1,405	0.79%
insult	7,877	4.41%
obscene	8,449	4.73%
severe_toxic	1,595	0.89%
threat	478	0.27%
toxic	15,294	8.57%

3.1.2 Barchart

Code

library(reticulate)
barchart <- df_r_grouped %>%
  ggplot(aes(x = reorder(label, count), y = count, fill = label)) +
  geom_col() +
  labs(
    x = "Labels",
    y = "Count"
  ) +
  # sort bars in descending order
  scale_x_discrete(limits = df_r_grouped$label[order(df_r_grouped$count, decreasing = TRUE)]) +
  scale_fill_brewer(type = "seq", palette = "RdYlBu") +
  theme_minimal()
ggplotly(barchart)

Figure 1: Imbalance in the dataset with clean variable

It is visible how much the dataset in imbalanced. This means it could be useful to check for the class weight and use this argument during the training.

It is clear that most of our text are clean. We are talking about 0.8033 of the observations which are clean. Only 0.1967 are toxic comments.

3.2 Sequence lenght definition

To convert the text in a useful input for a NN, it is necessary to use a TextVectorization layer. See the Section 4 section.

One of the method is output_sequence_length: to better define it, it is useful to analyze our text length. To simulate what the model we do, we are going to remove the punctuation and the new lines from the comments.

3.2.1 Summary

Code

library(reticulate)
df_r %>% 
  mutate(
    comment_text_clean = comment_text %>%
      tolower() %>% 
      str_remove_all("[[:punct:]]") %>% 
      str_replace_all("\n", " "),
    text_length = comment_text_clean %>% str_count()
    ) %>% 
  pull(text_length) %>% 
  summary() %>% 
  as.list() %>% 
  as_tibble() %>% 
  gt() %>% 
  tab_header(
    title = "Summary Statistics",
    subtitle = "of text length"
  ) %>% 
  fmt_number(
    drop_trailing_zeros = TRUE,
    drop_trailing_dec_mark = TRUE,
    use_seps = TRUE
  ) %>% 
  cols_align(
    align = "center",
  ) %>% 
  cols_label(
    Min. = "Min",
    `1st Qu.` = "Q1",
    Median = "Median",
    `3rd Qu.` = "Q3",
    Max. = "Max"
  )

Table 3: Summary of text length

Summary Statistics
of text length
Min	Q1	Median	Mean	Q3	Max
4	91	196	378.4	419	5,000

3.2.2 Boxplot

Code

library(reticulate)
boxplot <- df_r %>% 
  mutate(
    comment_text_clean = comment_text %>%
      tolower() %>% 
      str_remove_all("[[:punct:]]") %>% 
      str_replace_all("\n", " "),
    text_length = comment_text_clean %>% str_count()
    ) %>% 
  # pull(text_length) %>% 
  ggplot(aes(y = text_length)) +
  geom_boxplot() +
  theme_minimal()
ggplotly(boxplot)

Figure 2: Text length boxplot

3.2.3 Histogram

Code

df_ <- df_r %>% 
  mutate(
    comment_text_clean = comment_text %>%
      tolower() %>% 
      str_remove_all("[[:punct:]]") %>% 
      str_replace_all("\n", " "),
    text_length = comment_text_clean %>% str_count()
  )

Q1 <- quantile(df_$text_length, 0.25)
Q3 <- quantile(df_$text_length, 0.75)
IQR <- Q3 - Q1
upper_fence <- as.integer(Q3 + 1.5 * IQR)

histogram <- df_ %>% 
  ggplot(aes(x = text_length)) +
  geom_histogram(bins = 50) +
  geom_vline(aes(xintercept = upper_fence), color = "red", linetype = "dashed", linewidth = 1) +
  theme_minimal() +
  xlab("Text Length") +
  ylab("Frequency") +
  xlim(0, max(df_$text_length, upper_fence))
ggplotly(histogram)

Figure 3: Text length histogram with boxplot upper fence

Considering all the above analysis, I think a good starting value for the output_sequence_length is 911, the upper fence of the boxplot. In the last plot, it is the dashed red vertical line.. Doing so, we are removing the outliers, which are a small part of our dataset.

3.3 Dataset

Now we can split the dataset in 3: train, test and validation sets. Considering there is not a function in sklearn which lets split in these 3 sets, we can do the following: - split between a train and temporary set with a 0.3 split. - split the temporary set in 2 equal sized test and val sets.

Code

x = df[config.features].values
y = df[config.labels].values

xtrain, xtemp, ytrain, ytemp = train_test_split(
  x,
  y,
  test_size=config.temp_split, # .3
  random_state=config.random_state
  )
xtest, xval, ytest, yval = train_test_split(
  xtemp,
  ytemp,
  test_size=config.test_split, # .5
  random_state=config.random_state
  )

xtrain shape: py$xtrain.shape ytrain shape: py$ytrain.shape xtest shape: py$xtest.shape ytest shape: py$ytest.shape xval shape: py$xval.shape yval shape: py$yval.shape

The datasets are created using the tf.data.Dataset function. It creates a data input pipeline. The tf.data API makes it possible to handle large amounts of data, read from different data formats, and perform complex transformations. The tf.data.Dataset is an abstraction that represents a sequence of elements, in which each element consists of one or more components. Here each dataset is creates using from_tensor_slices. It create a tf.data.Dataset from a tuple (features, labels). .batch let us work in batches to improve performance, while .prefetch overlaps the preprocessing and model execution of a training step. While the model is executing training step s, the input pipeline is reading the data for step s+1. Check the documentation for further informations.

Code

train_ds = (
    tf.data.Dataset
    .from_tensor_slices((xtrain, ytrain))
    .shuffle(xtrain.shape[0])
    .batch(config.batch_size)
    .prefetch(tf.data.experimental.AUTOTUNE)
)

test_ds = (
    tf.data.Dataset
    .from_tensor_slices((xtest, ytest))
    .batch(config.batch_size)
    .prefetch(tf.data.experimental.AUTOTUNE)
)

val_ds = (
    tf.data.Dataset
    .from_tensor_slices((xval, yval))
    .batch(config.batch_size)
    .prefetch(tf.data.experimental.AUTOTUNE)
)

Code

print(
  f"train_ds cardinality: {train_ds.cardinality()}\n",
  f"val_ds cardinality: {val_ds.cardinality()}\n",
  f"test_ds cardinality: {test_ds.cardinality()}\n"
  )

train_ds cardinality: 3491
 val_ds cardinality: 748
 test_ds cardinality: 748

Check the first element of the dataset to be sure that the preprocessing is done correctly.

Code

train_ds.as_numpy_iterator().next()

(array([b'Because i killed them!',
       b'It is best to have these things checked if only so they can be dismissed (rather than having vague accusations floating around). ( )',
       b"Who died and left you queen of SPUI's page??? SPUI said he likes vandalism on his page and doesn't want it reverted. I'm not sure how many different ways saying it will will be required to get it through your thick skull.",
       b'You have not read the article properly. As you should know, I do not at all believe in the transmission of moveable type printing to the West, but the Chinese invention of woodblock printing cannot be taken away from them, as the previous (& current version) implies (by omitting all mention of it.  Hind remains a standard work on the subject, but this could be referenced from many other sources. I will rephrase the section to remove any possible ambiguity, although I think the previous version was perfectly clear to any reader without a POV.',
       b'This is mentioned in the articles Space Jockey (Alien) and Predator (alien)#Culture and history.',
       b'Stop reverting my edits and there will be no issues or be reported your choice (:',
       b"The Vanity of Human Wishes \n\nMany thanks for your assessment of the article on The Vanity of Human Wishes.  B-class isn't bad for an article that didn't even exist 22 days ago. Nice to feel appreciated.  I am not sure that there is enough notable information out there to warrant the article ever getting a higher rating, but maybe I am just unimaginative.",
       b"without any information about the forest elephant specifically, it does not need it's own article.",
       b'"\n\n 78th Academy Awards \n\nPlease stop using edit summaries that give away the winners. [[WP:EA|E]] (T+C) at 03:48 UTC (2006-03-06)\nEveryone is watching, it\'s not a spoiler.   Oh well, if it\'s annoying, fine.  , Monday March 6 2006 at 03:51"',
       b'"\n\nThank you! Haha I am a fan of Rescue Heroes, and I\'m 20. It\'s not like Barney & Friends since it doesn\'t have stuff on going to the doctor or anything like that, and it\'s not like Dora The Explorer since it doesn\'t say things like ""Do You See The Red Ball?"" I know from having to babysit for some kids sometimes, and besides, they showed how to put out a Grease Fire in one episode, what Four-Year-Old Would be allowed to do that? Anyway, thanks for telling me why you put the unreferenced tag there, you telling me was a big help.   "',
       b"Just improved the Taxpayer March on Washington article! \n\nI took out all those unofficial signs and gave you credit.  Thanks for the mentoring!  I took the lesson you gave me and then applied it to another article.  But now people are upset, saying that those signs, even if unofficial, are important.  Let's not let them win! I anticipate your immediate support..",
       b"LINK#2 of this article (Foradejogo.net) usually contains, for Portuguese footballers (and foreigners of many seasons, as is Pa\xc3\xadto's case), stats for all national teams (and categories) represented. In his page, however, we only have full caps for Mozambique, so i assume he did not play for Portugal U21 (i am sorry i can't remember, even though i am Portuguese, we can't remember about all the players ) ) \n\nCheers - 217.129.65.5",
       b'i just got on here a little while ago i am looking for the Tennis player Giovanni Lapentti no luck yet but i will keep looking',
       b"57, 15 March 2008 (UTC)\nAre there any statistics you can provided that can prove this claim? These sites don't seem very diverse in thier coverage of the topic.   15",
       b'"I understand that you are keen to contribute to our article on fields. Now that your block for edit-warring has expired, you must decide how you want to progress this. I notice that while blocked you continued to edit the talk page with an IP. This is not permitted and I would be within policy and custom to reblock you for this, see WP:SOCK. I decline this time to do so, but I warn you not to repeat this or further measures will be taken against you. I hope this will not become necessary as you will not require to be blocked again. Do not edit war, see WP:EW, but discuss with your sources (see WP:IRS and WP:V) on the talk page, under your proper account. \n\nI know that editing here is a steep learning curve and I am willing to help you do it right. You must accept that we are a mature community with very well-established policies and customs. If you come here you have to work within that, just like if you get a job in a university you have to listen to the people who are already working there. I can tell you have a lot to offer to the project. Please make your arguments in talk, be patient and kind to the folks who wrote the article (I am not one but was asked to look at this by someone who is. This does not mean I endorse their opinion on how the article should read.) even (especially) when you disagree with their writing, and accept that you will seldom get all of what you want, just as in most areas of life. I am a volunteer just like you, but I have been around for a good while and I am entrusted with the janitorial role of making sure that things don\'t get damaged. In support of that role I am allowed to block users, protect articles from editing and certain other powers. I would very much rather not use them any further here. I am also a science graduate, and have some passing understanding of the subject you are here to edit, for whatever that may be worth.\n\nPlease resume your discussions in talk, but please do not edit-war or sock again. Let me know if you need any further help and I will be glad to offer it to you.   \n\nMy voice is important to me, if you take it upon yourself to snuff it out, or otherwise misrepresent it, in my book you are not out to help. \n\nThe good cop bad cop, all-in-one schizo stuff is kinda weird; & your paternalism is waaaay off the mark. Ignoring you was not ""unwise"", & I am hardly intimidated by your inadequate mastery of the rules you wish to nail to the wall. Perhaps you should review these dictates more closely yourself & check your double standards at the door. I am speaking of editing my criticisms of your actions amongst other things.\n\nI did not start an ""edit war"" as you imply. Your issue was personaldisobedience to be exact, with a sprinkling of that self empowering sparkle which lording over others endowsthe joy of duct tape & racquetball, which only a cop-at-heart can understand. I did not practice deception as you imply with the ""sock puppet"", I signed my name to every puppets comment. I am not a perp, as implied by your actions in sum. Therefore you are barking up the wrong tree. I don\'t buy it that you are a robot bound to regs just doing your job as implied; you are a person making choices & you chose bad faith & laziness.\n\nThe generalisms you sincerely & patiently list, I\'m sure are relevant in the wider context & larger community that you refer to & I\'m sure what you do is of value there. I don\'t envy your job. In this case, however, we\'re talking two people debating in a forum, & an article that needs to be kept safe & sterile from any potential change lest something terrible happen! Is the cop/citizen ratio so mature in this community that there are resources to police such minor quibbles? Where am I Disneyland? Tracking IP\'s attached to names in order to reign in such terrorism is downright dumb for a supposedly progressive community. See WP WTF\n\nIf you place an obedient noob filter on any & every minor impulse-response test that passes though looking to improve upon the low grade blasphemy passing itself off for knowledge, you will reap what you sow. The page will never make it to the C grade it claims to be.\n\nI do care about the subject, & *your* site is misrepresenting it for years now. I will continue to do what I can to fix it, with or without your rules if it\'s important to you that you stand in my way. There has been zero effort to address the fundamental criticism brought up a year ago now. How many people are going to bother explaining themselves after stroking a potentially infinite series of well meaning, but primarily self-serving policy enforcers hovering over a page, in order to change a few sentences? \n\nSince you have a science degree, you might know that this is a highly contentious topic. This is not an a',
       b'"Let me get this straight, because Mall comes here to defend you, me asking you to be civil is trolling? I am glad you don\'t have any special buttons to go along with your interpretation of policy, because it does not seem to follow its wording or spirit. Chillum \n\n"',
       b'600 police officers,',
       b'Thankyou Sitush , your concern for me is touching , am sorry if my indents have  irritated you ! .',
       b'I DETEST you \n\nI absolutely hate you. You are scum. Nothing more than scum. I can tell by your wikipedia editing.',
       b'"\nI thought the better of leaving the comment. The delay after the fact was not the issue for me: my off-the-cuff comment didn\'t make clear the issue and would have required more effort to clarify for no gain for either of us. It\'s a little hard for me when someone who doesn\'t know the material makes edits that only confuse the discussion. The source of the word in question was a Hebrew word \xd7\xa0\xd7\x96\xd7\x99\xd7\xa8, ""dedicated"", which has the third letter, YOD, the source of the Greek iota (in \xce\x9d\xce\xb1\xce\xb6\xce\xb9\xcf\x81\xce\xb1\xce\xb9\xce\xbf\xcf\x82) gives the English ""i"". Samson was a \xd7\xa0\xd7\x96\xd7\x99\xd7\xa8, Nazirite... and so on. Too much effort and its point would have been lost. I couldn\'t improve on the comment, so I cut it. I couldn\'t do anything about the history. What should I have done, once the comment was there? I usually do a lot of editing: my first comments often get changed.  control "',
       b"LGBT film? \nUh, no?  This is definitely a film about pedophilia.  I'm removing those categories, because it's not only inaccurate, but can be considered offensive to many.  71.59.189.46",
       b'"\nBoth IPs are you, per my comments above. There\'s no room for ""doesn\'t seem likely."" It\'s not a coincidence that both IPs start with 99 and made edits that supported you (removing your comment/continuing your debate), just as it is not a coincidence when such IP ranges edit the To Catch a Predator article or any of the Wikipedia age of consent articles, or when you made a comment as this IP range on my talk page. I don\'t even have to provide a diff for that either, since you\'ve obviously looked over that time period in my talk page edit history. Even if Id did, you would likely only deny it, just as you have denied two of the most obvious WP:DUCK edits in the history of the Wikipedia. You talk about bad ideas. What is a bad idea is that you are still debating this and demanding that I show you proof for a comment you made. You\'d deny it anyway. Like I stated, stop wasting my time. I don\'t want to read another thing from you. You can keep on denying and demanding; it won\'t make a bit of difference with regard to you being believed/getting your way. I don\'t care for your apology. Just leave me alone unless there has to be interaction between us. And about the obsessed bit, I stated ""like a lot of other males."" Not ""male users."" It was in reference to what I\'ve faced outside of Wikipedia.   "',
       b'"\n\nI understand why they show this on your source, because they include also the East Germany. So you have to split the East Germany on a separate record.\n\nMy sources   - here is a bit different, because they count Win or Lose also for a penalty game.\n\nA penalty game is considered drawn by FIFA. You should now this, I will send you a reference if you do not know. So you have to modify in this article in case of this situations. \n\nOther source   12 - 5 - 12.\n\nEngland national football team all-time record - if you have a look here, also you will see the difference.\nI did modify this articles and add colors, if you want to do it for this article also, and to split the East Germany from the main table.\nYou can add also ""Combined predecessor and successor Records"" a separate table for this cases. Read the England national football team all-time record article to know what I am talking about. I hope you did understand me. Thanks!  "',
       b"Anonymous is a hacktivist group, they have hackers who are politically motivated, and they comprise a decent portion of the group. My issue with this article is that it relies all to heavily on one book by Olson, which, though OR isn't allowed, I know, is very wrong in some parts. Such as claiming that Anonymous has rules at all.",
       b'It fails as a third party or neutral source. 99.141.246.39',
       b'Primary topic \n\nAccording to the stats site Million Dollar Band (country music group) got 521 hits in November, while Million Dollar Band (marching band) got 1141 hits. In my opinion that is not significantly more to suggest the marching band is the primary topic per WP:PRIMARYTOPIC.',
       b'"\n\nThe PVV has nothing to do with any definition of libertarianism. i have no idea where that would come from. futhermore, we usually label people by their ideologies, rather than defining ideologies by the people we choose to associate with it. maybe Vladimir Lenin is right-wring and/or conservative.\xc2\xb7 Lygophile has spoken "',
       b'MYSPACE \n\ncould someone please verify which one of these is the OFFICIAL one?  thanks!\n\nhttp://www.myspace.com/dnegreanu\n\nhttp://profile.myspace.com/index.cfm?fuseaction=user.viewprofile&friendid;=63618592',
       b'Egypt and Mesopotamia, and even the Levant, are full of older cities. They are just all gone now. And can I please see the evidence that puts a settlement (not city) there into the 4th millennium BCE? I would be surprised if there is any real evidence for it predating 2500 or so.',
       b'"\n\nHave made a start, to show willing! There are many more references but mostly in the academic literature, inaccessible to general users of the www. I\'ll use these where needed, but will try and find some references that can be accessed freely as well. I have to declare a potential COI here; I am a practitioner of CAT so I have a vested interest in the article. However I guess it also means I have ready access to the sources needed, so maybe I\'m not in a bad position to do this as long as others can check the neutrality of what I contribute!  (Talk) "',
       b'When does my block expire? June 3rd right?Rocky',
       b'Yep, just beware of some sites that may have not gotten proper permission.  I know of one that offers scans and transcribes of Spectrum, Zap, etc but they have only obtained permission from the authors of those articles, not the publishers.  That means that they did not really get permission from the copyright holders (publishers), hence, their hosting of those articles are still copyviolations.'],
      dtype=object), array([[1, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0],
       [1, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0],
       [1, 0, 1, 0, 1, 0],
       [0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0]]))

And we check also the shape. We expect a feature of shape (batch, ) and a target of shape (batch, number of labels).

Code

print(
  f"text train shape: {train_ds.as_numpy_iterator().next()[0].shape}\n",
  f" text train type: {train_ds.as_numpy_iterator().next()[0].dtype}\n",
  f"label train shape: {train_ds.as_numpy_iterator().next()[1].shape}\n",
  f"label train type: {train_ds.as_numpy_iterator().next()[1].dtype}\n"
  )

text train shape: (32,)
  text train type: object
 label train shape: (32, 6)
 label train type: int64

4 Preprocessing

Of course preprocessing! Text is not the type of input a NN can handle. The TextVectorization layer is meant to handle natural language inputs. The processing of each example contains the following steps: 1. Standardize each example (usually lowercasing + punctuation stripping) 2. Split each example into substrings (usually words) 3. Recombine substrings into tokens (usually ngrams) 4. Index tokens (associate a unique int value with each token) 5. Transform each example using this index, either into a vector of ints or a dense float vector.

For more reference, see the documentation at the following link.

Code

text_vectorization = TextVectorization(
  max_tokens=config.max_tokens,
  standardize="lower_and_strip_punctuation",
  split="whitespace",
  output_mode="int",
  output_sequence_length=config.output_sequence_length,
  pad_to_max_tokens=True
  )

# prepare a dataset that only yields raw text inputs (no labels)
text_train_ds = train_ds.map(lambda x, y: x)
# adapt the text vectorization layer to the text data to index the dataset vocabulary
text_vectorization.adapt(text_train_ds)

This layer is set to: - max_tokens: 20000. It is common for text classification. It is the maximum size of the vocabulary for this layer. - output_sequence_length: 911. See Figure 3 for the reason why. Only valid in "int" mode. - output_mode: outputs integer indices, one integer index per split string token. When output_mode == “int”, 0 is reserved for masked locations; this reduces the vocab size to max_tokens - 2 instead of max_tokens - 1. - standardize: "lower_and_strip_punctuation". - split: on whitespace.

To preserve the original comments as text and also have a tf.data.Dataset in which the text is preprocessed by the TextVectorization function, it is possible to map it to the features of each dataset.

Code

processed_train_ds = train_ds.map(
    lambda x, y: (text_vectorization(x), y),
    num_parallel_calls=tf.data.experimental.AUTOTUNE
)
processed_val_ds = val_ds.map(
    lambda x, y: (text_vectorization(x), y),
    num_parallel_calls=tf.data.experimental.AUTOTUNE
)
processed_test_ds = test_ds.map(
    lambda x, y: (text_vectorization(x), y),
    num_parallel_calls=tf.data.experimental.AUTOTUNE
)

5 Model

5.1 Definition

Define the model using the Functional API.

Code

def get_deeper_lstm_model():
    
    clear_session()
    
    inputs = Input(shape=(None,), dtype=tf.int64, name="inputs")
    
    embedding = Embedding(
        input_dim=config.max_tokens,
        output_dim=config.embedding_dim,
        mask_zero=True,
        name="embedding"
    )(inputs)
    x = Bidirectional(LSTM(256, return_sequences=True, name="bilstm_1"))(embedding)
    x = Bidirectional(LSTM(128, return_sequences=True, name="bilstm_2"))(x)
    # Global average pooling
    x = GlobalAveragePooling1D()(x)
    # Add regularization
    x = Dropout(0.3)(x)
    x = Dense(64, activation='relu', kernel_regularizer=tf.keras.regularizers.l2(0.01))(x)
    x = LayerNormalization()(x)
    
    outputs = Dense(len(config.labels), activation='sigmoid', name="outputs")(x)
    
    model = Model(inputs, outputs)
    model.compile(
      optimizer='adam',
      loss="binary_crossentropy",
      metrics=config.metrics,
      steps_per_execution=32
      )
    
    return model

lstm_model = get_deeper_lstm_model()
lstm_model.summary()

5.2 Callbacks

Finally, the model has been trained using 2 callbacks: - Early Stopping, to avoid to consume the kaggle GPU time. - Model Checkpoint, to retrieve the best model training information.

Code

my_es = config.get_early_stopping()
my_mc = config.get_model_checkpoint(filepath="/checkpoint.keras")
callbacks = [my_es, my_mc]

5.3 Final preparation before fit

Considering the dataset is imbalanced, to increase the performance we need to calculate the class weight. This will be passed during the training of the model.

Code

lab = pd.DataFrame(columns=config.labels, data=ytrain)
r = lab.sum() / len(ytrain)
class_weight = dict(zip(range(len(config.labels)), r))
df_class_weight = pd.DataFrame.from_dict(
  data=class_weight,
  orient='index',
  columns=['class_weight']
  )
df_class_weight.index = config.labels

Code

py$df_class_weight %>% 
  gt() %>% 
  fmt_percent(
    decimals = 2,
    drop_trailing_zeros = TRUE,
    drop_trailing_dec_mark = TRUE
  )

Table 4: Class weight

class_weight
9.59%
0.99%
5.28%
0.31%
4.91%
0.87%

It is also useful to define the steps per epoch for train and validation dataset. This step is required to avoid to not consume entirely the dataset during the fit, which happened to me.

Code

steps_per_epoch = config.train_samples // config.batch_size
validation_steps = config.val_samples // config.batch_size

5.4 Fit

The fit has been done on Kaggle to levarage the GPU. Some considerations about the model:

.repeat() ensure the model sees all the dataset.
epocs is set to 100.
validation_data has the same repeat.
callbacks are the one defined before.
class_weight ensure the model is trained using the frequency of each class, because our dataset is imbalanced.
steps_per_epoch and validation_steps depend on the use of repeat.

Code

history = model.fit(
  processed_train_ds.repeat(),
  epochs=config.epochs,
  validation_data=processed_val_ds.repeat(),
  callbacks=callbacks,
  class_weight=class_weight,
  steps_per_epoch=steps_per_epoch,
  validation_steps=validation_steps
  )

Now we can import the model and the history trained on Kaggle.

Code

model = load_model(filepath=config.model)
history = pd.read_excel(config.history)

5.5 Evaluate

Code

validation = model.evaluate(
  processed_val_ds.repeat(),
  steps=validation_steps, # 748
  verbose=0
  )

Code

val_metrics <- tibble(
  metric = c("loss", "precision", "recall", "auc", "f1_score"),
  value = py$validation
  )
val_metrics %>% 
  gt() %>% 
  fmt_number(
    columns = c("value"),
    decimals = 4,
    drop_trailing_zeros = TRUE,
    drop_trailing_dec_mark = TRUE
  ) %>% 
  cols_align(
    align = "left",
    columns = metric
  ) %>% 
  cols_align(
    align = "center",
    columns = value
  ) %>% 
  cols_label(
    metric = "Metric",
    value = "Value"
  )

Table 5: Model validation metric

Metric	Value
loss	0.0635
precision	0.6866
recall	0.7063
auc	0.9507
f1_score	0.0297

5.6 Predict

For the prediction, the model does not need to repeat the dataset, because it has already been trained on all of the train data. Now it has just to consume the new data to make the prediction.

Code

predictions = model.predict(processed_test_ds, verbose=0)

5.7 Confusion Matrix

The best way to assess the performance of a multi label classification is using a confusion matrix. Sklearn has a specific function to create a multi label classification matrix to handle the fact that there could be multiple labels for one prediction.

5.7.1 Grid Search Cross Validation for best threshold

Grid Search CV is a technique for fine-tuning hyperparameter of a ML model. It systematically search through a set of hyperparamenter values to find the combination which led to the best model performance. In this case, I am using a KFold Cross Validation is a resempling technique to split the data into k consecutive folds. Each fold is used once as a validation while the k - 1 remaining folds are the training set. See the documentation for more information.

The model is trained to optimize the recall. The decision was made because the cost of missing a True Positive is greater than a False Positive. In this case, missing a injurious observation is worst than classifying a clean one as bad.

5.7.2 Confidence threshold and Precision-Recall trade off

Whilst the KFold GDCV technique is usefull to test multiple hyperparameter, it is important to understand the problem we are facing. A multi label deep learning classifier outputs a vector of per-class probabilities. These need to be converted to a binary vector using a confidence threshold.

The higher the threshold, the less classes the model predicts, increasing model confidence [higher Precision] and increasing missed classes [lower Recall].
The lower the threshold, the more classes the model predicts, decreasing model confidence [lower Precision] and decreasing missed classes [higher Recall].

Threshold selection mean we have to decide which metric to prioritize, based on the problem we are facing and the relative cost of misduging. We can consider the toxic comment filtering a problem similiar to cancer diagnostic. It is better to predict cancer in people who do not have it [False Positive] and perform further analysis than do not predict cancer when the patient has the disease [False Negative].

I decide to train the model on the F1 score to have a balanced model in both precision and recall and leave to the threshold selection to increase the recall performance.

Moreover, the model has been trained on the macro avarage F1 score, which is a single performance indicator obtained by the mean of the Precision and Recall scores of individual classses.

\[ F1\ macro\ avg = \frac{\sum_{i=1}^{n} F1_i}{n} \]

It is useful with imbalanced classes, because it weights each classes equally. It is not influenced by the number of samples of each classes. This is sette both in the config.metrics and find_optimal_threshold_cv.

f1_score

Code

ytrue = ytest.astype(int)
y_pred_proba = predictions
optimal_threshold_f1, best_score_f1 = config.find_optimal_threshold_cv(ytrue, y_pred_proba, f1_score)

print(f"Optimal threshold: {optimal_threshold_f1}")

Optimal threshold: 0.30000000000000004

Code

print(f"Best score: {best_score_f1}")

Best score: 0.5088499203486339

Code


# Use the optimal threshold to make predictions
final_predictions_f1 = (y_pred_proba >= optimal_threshold_f1).astype(int)

Optimal threshold f1 score: 0.3. Best score: 0.5088499.

recall_score

Code

ytrue = ytest.astype(int)
y_pred_proba = predictions
optimal_threshold_recall, best_score_recall = config.find_optimal_threshold_cv(ytrue, y_pred_proba, recall_score)

# Use the optimal threshold to make predictions
final_predictions_recall = (y_pred_proba >= optimal_threshold_recall).astype(int)

Optimal threshold recall: 0.05. Best score: 0.7891362.

roc_auc_score

Code

ytrue = ytest.astype(int)
y_pred_proba = predictions
optimal_threshold_roc, best_score_roc = config.find_optimal_threshold_cv(ytrue, y_pred_proba, roc_auc_score)

print(f"Optimal threshold: {optimal_threshold_roc}")

Optimal threshold: 0.05

Code

print(f"Best score: {best_score_roc}")

Best score: 0.8729891504097674

Code


# Use the optimal threshold to make predictions
final_predictions_roc = (y_pred_proba >= optimal_threshold_roc).astype(int)

Optimal threshold roc: 0.05. Best score: 0.8729892.

5.7.3 Confusion Matrix Plot

Code

# convert probability predictions to predictions
ypred = predictions >=  optimal_threshold_recall # .05
ypred = ypred.astype(int)

# create a plot with 3 by 2 subplots
fig, axes = plt.subplots(3, 2, figsize=(15, 15))
axes = axes.flatten()
mcm = multilabel_confusion_matrix(ytrue, ypred)
# plot the confusion matrices for each label
for i, (cm, label) in enumerate(zip(mcm, config.labels)):
    disp = ConfusionMatrixDisplay(confusion_matrix=cm)
    disp.plot(ax=axes[i], colorbar=False)
    axes[i].set_title(f"Confusion matrix for label: {label}")
plt.tight_layout()
plt.show()

5.8 Classification Report

Code

cr = classification_report(
  ytrue,
  ypred,
  target_names=config.labels,
  digits=4,
  output_dict=True
  )
df_cr = pd.DataFrame.from_dict(cr).reset_index()

Code

library(reticulate)
df_cr <- py$df_cr %>% dplyr::rename(names = index)
cols <- df_cr %>% colnames()
df_cr %>% 
  pivot_longer(
    cols = -names,
    names_to = "metrics",
    values_to = "values"
  ) %>% 
  pivot_wider(
    names_from = names,
    values_from = values
  ) %>% 
  gt() %>%
  tab_header(
    title = "Confusion Matrix",
    subtitle = "Threshold optimization favoring recall"
  ) %>% 
  fmt_number(
    columns = c("precision", "recall", "f1-score", "support"),
    decimals = 2,
    drop_trailing_zeros = TRUE,
    drop_trailing_dec_mark = FALSE
  ) %>% 
  cols_align(
    align = "center",
    columns = c("precision", "recall", "f1-score", "support")
  ) %>% 
  cols_align(
    align = "left",
    columns = metrics
  ) %>% 
  cols_label(
    metrics = "Metrics",
    precision = "Precision",
    recall = "Recall",
    `f1-score` = "F1-Score",
    support = "Support"
  )

Table 6: Classification report

Confusion Matrix
Threshold optimization favoring recall
Metrics	Precision	Recall	F1-Score	Support
toxic	0.57	0.88	0.69	2,262.
severe_toxic	0.25	0.9	0.39	240.
obscene	0.56	0.94	0.7	1,263.
threat	0.04	0.46	0.07	69.
insult	0.5	0.9	0.64	1,170.
identity_hate	0.13	0.66	0.21	207.
micro avg	0.44	0.89	0.59	5,211.
macro avg	0.34	0.79	0.45	5,211.
weighted avg	0.51	0.89	0.64	5,211.
samples avg	0.05	0.08	0.06	5,211.

6 Conclusions

The BiLSTM model is optimized to have an high recall is performing good enough to make predictions for each label. Considering the low support for the threat label, the performance is not bad. See Table 2 and Figure 1: the threat label is only 0.27 % of the observations. The model has been optimized for recall because the cost of not identifying a injurious comment as such is higher than the cost of considering a clean comment as injurious.

Possibile improvements could be to increase the number of observations, expecially for the threat one. In general there are too many clean comments. This could be avoided doing an undersampling of the clean comment, which I explicitly avoided to check the performance on the BiLSTM with an imbalanced dataset, leveraging the class weight method.