1 Introduction

The freedom of speech and expression offered by social media platforms is misused by some people to fill these platforms with abusive content. Though adults can manage this menace to some extent, children, and teens are susceptible to serious mental health issues, as reported by Temper et al. (2013), Sonone et al. (2021), and Dellerman (2022). There has been a 70% increase in the amount of bullying/hate speech among teens and children since the Covid-19 lockdown.Footnote 1 This has resulted in rising interest in artificial intelligence and natural language processing community relating to social and ethical challenges, which has been fueled by the worldwide commitment to combat toxic content. This toxic content is synonymous with hate, offensive, abusive, cyberbullying, violence, and other online forms of harassment. The growing interest in addressing this menace of toxic content by the computer science community in recent times is evident from the workshops such as TRAC 2020,Footnote 2 STOC 2020Footnote 3 and WOAH6Footnote 4 to be held in 2022. Most of the social media platforms follow content moderation to restrict toxic content. Due to the massive scale of the online content, we need unbiased and scalable systems to detect toxic content in real-time. These systems will gain people’s confidence if they identify the span of text that is responsible for classifying the content as toxic. Toxic free social media platforms are needed to promote healthy discussions among the people.

Earlier research in this domain focused on identifying whether the entire content is toxic or not by classification methods. These models range from machine learning models such as logistic regression, support vector machines (Waseem and Hovy, 2016; Davidson et al., 2017) and deep learning models such as CNN, RNN, and Attention networks (Park & Fung, 2017; Badjatiya et al., 2017; Founta et al., 2019; Chakrabarty et al., 2019; Chia et al., 2021; Kiran Babu & HimaBindu, 2022). Convolutional Recurrent Neural networks (Ashok Kumar et al., 2021; Elnaggar et al., 2018; Zhang et al., 2018) are also employed to capture long-term dependencies in social media text. To improve the classification performance, transformer models are employed in Mozafari et al. (2019), Caselli et al. (2021), and Fortuna et al. (2021). These machines’ existing inability to explain their judgments and actions to human users limits the efficacy of these systems. Thus, there is a need to develop self-contained, and explainable systems. For toxic comment classification, the toxic span serves as the rationale. Recent research (Mathew et al., 2021; **ang et al., 2021) focused on explainable toxic comment classification by predicting the span of the comment. According to Adadi and Berrada (2018), explanations can be used to justify the decision, and improve the accuracy and transparency of the model.

Toxic comment classification (TCC) and toxic span prediction (TSP) are related tasks. Multi-task models have shown better performance when related tasks are trained together (Gong et al., 2019; Ed-drissiya et al., 2021). Multi-task Learning (MTL) jointly learns from multiple related tasks. MTL can be viewed as a type of inductive transfer and can improve the model’s generalization for individual tasks by exchanging representations (e.g., shared parameters) between similar tasks. According to Baxter (2000), inductive transfer can assist enhancement of a model by providing an inductive bias that causes the model to favor one hypothesis over others.

In this paper, we propose a multi-task neural network model for toxic comment classification and rationale extraction. This multi-task neural network model jointly learns on sequence classification and span prediction tasks, and improved the performance of both tasks. Section 6 shows that the accuracy and F1 scores are better for both the classification and span prediction of the proposed MTL model. We curated a dataset from JigsawFootnote 5 and TSD (Toxic Span Detection) (Pavlopoulos et al., 2021) datasets as shown in Fig. 1 to enable multi-task learning. The Jigsaw dataset contains social media text with annotated class labels. The TSD dataset contains all toxic comments annotated with toxic spans. The model is trained on the curated dataset containing the triple < text,class_label,toxic_span >. The trained model’s goal is to predict the class label (toxic or non-toxic) and toxic span when the social media text is given as input.

Fig. 1
figure 1

Toxic comment classification and toxic span prediction system

Our experimental results on the curated dataset and TSD dataset (Pavlopoulos et al., 2021) shows that our single MTL model improves the performance of both the classification and toxic spans prediction. In SemEval-2021 Task5 (Pavlopoulos et al., 2021), ensemble models are the winners. We claim that the proposed single MTL model is competitive enough with these ensemble models. In general, a model which is trained on a large coverage dataset in one domain and tested on a smaller coverage dataset in another domain will give better performance. When testing on unseen data, the proposed model is competitive enough with in-domain models. This shows that the proposed multi-task model not only improves the prediction performance but also improves the domain adaptation. We tested the transferability of the model trained on curated dataset on the unseen datasets viz. HASOCFootnote 6 and OLID.Footnote 7 This paper uses the words’ rationale, explanation and toxic span interchangeably.

1.1 Research questions

This paper addresses the following research questions:

  1. RQ1

    What neural network models are beneficial for toxic comment classification and rationale extraction? We examine Single Task Learning (STL) methodologies in deep learning and present an MTL architecture based on past research. We also use two STL baseline models to offer a more detailed examination.

  2. RQ2

    What datasets are available to train and evaluate the MTL model? No datasets are publicly available as per our knowledge that are compatible with MTL. We curated a dataset from two publicly available datasets.

  3. RQ3

    Does MTL improve the performance of toxic comment classification and rationale extraction? To answer this question, we compare the proposed MTL with STL baselines and assess the influence of MTL on the two related tasks by measuring their accuracy and F1 scores.

  4. RQ4

    What pretrained language models are suitable for toxic comment classification and rationale extraction? The word representations are crucial for every NLP task. Hence, we experimented with three publicly available pretrained transformer models and their fine-tuned variants.

  5. RQ5

    Does MTL improve domain adaptation and generalizability of the model? We collected two publicly available datasets which include hate or offensive texts to evaluate the proposed MTL and STL baselines to answer this question.

1.2 Contributions

This paper’s contributions are as follows:

  1. 1.

    We present a multi-task learning model by leveraging the joint learning of toxic comment classification and toxic span prediction. The model uses a transformer-based Bi-LSTM CRF layer for these tasks to answer RQ1.

  2. 2.

    We curate a dataset of 29623 comments consisting of class labels and toxic spans to enable multi-task learning. This dataset is curated from the Jigsaw and Toxic span detection (TSD) datasets. (RQ2)

  3. 3.

    We provide an empirical analysis of the classification and span prediction performance of the multi-task model to answer RQ3.

  4. 4.

    We provide results and analysis of six transformer- based models, domain adaptation experiments on unseen datasets, and error analysis to understand the limitations of the proposed model, thereby answering RQ4 and RQ5.

The rest of the paper is organized as follows. We provide a brief summary of different explainable methods: post-hoc (gradient-based, perturbation based, attention-based) and constitutive methods where human rationales are used to learn explanations (Section 2). Section 3 briefly describes the usage of MTL. Section 4 provides the multi-task model used to jointly learn from toxic comment classification and toxic span prediction tasks. The experimental results are presented in Section 5, followed by result analysis (Section 6). Section 7 contains discussion of the results and the error analysis, and Section 8 concludes the paper with a summary and future works.

2 Related works

Machine learning explainability has lately received a lot of attention owing to the necessity for transparency. Existing explainable methodologies are classified into two types: post-hoc explainability and constitutive explainability. The goal of post-hoc explainability is to provide explanations for existing models. In this category, LIME (Ribeiro et al., 2016) is a representational approach that uses an explainable model to approximate model decisions in the local area of the feature space. Gradient based techniques (Karen et al., 2014; Sundararajan et al., 2017) find relevant properties by calculating the gradient of an output concerning an input feature and estimate the contribution of various input features. Attention mechanisms (Vaswani et al., 2017; Xu et al., 2015) can also be used as explanations, which identify sections of the input that are attended by the model for specific predictions. Attention mechanisms are a more prevalent way of explaining individual predictions. This attention mechanism has played a key role in NLP, not just for explainability, but also for improving model performance (Devlin et al., 2019). However, Wiegreffe and Pinter (2019) recently called explainability effectiveness into question by pointing out that attention distributions are inconsistent with the significance of input units as measured by gradient-based methods.

In the second category, the goal of constitutive explainability is to build self-explanatory models. This is accomplished by including explainability restrictions through loss function into the model’s learning process by the use of human rationales (**ang et al., 2021; Zaidan et al., 2007). Zaidan et al. (2007) first proposed the use of rationales, in which human annotators highlight a section of text that justifies their labelling judgement. These extended reasoning annotations are used on a lesser amount of training data by these authors to improve sentiment categorization.

The majority of the existing works on toxicity detection have focused on enhancing model performance, with little emphasis on explainability. For explainable toxicity classification, Mathew et al. (2021) provided a benchmark annotated dataset with rationales and built an explainable hate speech detection model. Similarly, Pavlopoulos et al. (2021) provide an annotated dataset containing toxic comments with span labels. Due to the nature of the task given, most of the submissions (Bansal et al., 2021; Sharma et al., 2021; Zou and Li, 2021; Khan et al., 2021; Nguyen et al., 2021; Zhu et al., 2021) for SemEval-2021 Task 5 are aimed at identifying the toxic spans and does not provide the entire classification label.

MTL methodologies, recently utilized for sentiment analysis (Akhtar et al., 2020; Wang et al., 2021) and adverse drug events extraction (Ed-drissiya et al., 2021) have shown state-of-the-art performance. **ang et al. (2021) first utilized the MTL approach for detecting explainable toxicity in social media posts. This model uses BERT and jointly learns from sequence labels and span labels. The authors considered predicting span as a token classification problem and tokens are identified as part of the span or not based on toxicity score. They have used the Mean Squared Error (MSE) loss to train the model. This model makes a local decision at every point of the sequence. Each token classification is independent of other tokens’ classification. Moreover, the class label of the entire sequence is identified by the toxic span scores. If there is indirect toxicity where toxic scores of tokens are low, this model is unable to predict the toxic classes even though the text is toxic. As tokens of the toxic spans depend on each other, instead of identifying individual tokens as part of the toxic span, we modelled this problem as a sequence tagging problem, inspired from (Chen et al., 2019; Ed-drissiya et al., 2021). Conditional Random Fields (CRF) are used to predict token labels by taking the full sequence into account. This is especially beneficial in NLP applications where word sequences depend on each other and certain word sequences are implausible. Hence, we devised a transformer-based Bi-LSTM CRF network to predict the toxic span and the class label of the social media post. The toxic span is the cause for labelling the post as toxic, and it is empty string for non-toxic posts.

The introduction of transfer learning has undoubtedly aided in the acceleration of NLP research. We can leverage a pretrained model generated on a massive dataset and adjust it for different tasks on a task-specific dataset. Transformer-based pretrained language models have proven to be effective for various NLP tasks. A prime example is BERT (Devlin et al., 2019), which employs bidirectional transformer architecture to learn word association during pre-training, utilizing Masked Language Modelling and Next Sentence Prediction tasks for self-supervised learning. With an embedding size of 768, the transformer-based embeddings provide semantic and syntactic information of the input tokens. RoBERTa (Liu et al., 2019b) is a hyperparameter fine-tuned variant of BERT, and it is pretrained on a 160 GB corpus using a dynamic masking method. RoBERTa doesn’t use Next Sentence Prediction for learning. To work with multilingual data, XLM-RoBERTa (XLMR) (Conneau et al., 2020) was trained on more than 2 TB of Common Crawl data of 100 different languages. Transformer-based models are recently being used for fake news detection and censored tweets classification (Mehta et al., 2021; Ahmed and Kumar M., 2021) for better prediction performance. In this paper, BERT, RoBERTa and XLMR are used in all the experiments to learn contextual embeddings of the input sequence. We have also experimented with fine-tuned versions of these models viz., ToxicBERT,Footnote 8 ToxicRoBERTaFootnote 9 and ToxicXLMRFootnote 10 to test the effect of fine-tuned transformers on the proposed task.

3 Multi-task learning

Multi-task learning (MTL) utilizes multiple similar tasks that can regularize each other to improve the performance of the target task. MTL has its roots in Caruana’s pioneering research (Caruana 1993, 1997). MTL is subsequently used in a wide variety of machine learning applications, including Computer Vision (Long et al., 2017), Bioinformatics (Ramsundar et al., 2015), and various subfields of natural language processing (Worsham & Kalita, 2020; McCann et al., 2018). The basic idea behind MTL is to train a model that provides outputs for multiple related tasks based on a single common input. We contrast this with traditional machine learning techniques, in which a model is often a function from a single input space to a single output space. The reasoning behind the MTL concept is that information captured in training data for one task may assist the model to generalize better when learning on another related task.

From a theoretical viewpoint, MTL has the advantage of acting as a regularizer for a specific task to build generalized models that can handle unseen data. More precisely, as we optimize parameters for many tasks at the same time, the additional details contained in the auxiliary tasks function as a means to prevent the model from overfitting to the training data. Another benefit of MTL that we have used in our study is its capacity to learn from several related datasets. This implies that we may integrate datasets from a variety of jobs without having to re-annotate the data (to ensure that the label spaces are consistent). MTL is mostly preferred when the target tasks improve performance when compared to the single-task model (Standley et al., 2020).

When selecting a set of tasks for MTL, some design considerations for modelling and training data are influenced by the relative priority of the tasks. If we are just interested in a single job; we can simply optimize our model to produce the greatest possible performance for that particular activity. When we are equally interested in good performance across all tasks, our work becomes significantly more difficult, since we require to strike a compromise between performance ratings across all the activities. It isn’t always advantageous in enhancing a classifier’s performance (Worsham & Kalita, 2020). Aside from the increasing complexity of the model and training time, the relationship between tasks and datasets is crucial for MTL effectiveness. After deciding on the model’s design, we must determine the loss function for optimization. The most basic technique is to minimize a linear combination of the loss functions of each task. Every task has its own loss function (Ltask). In our multi-task approach, we simply weigh each loss function and minimize the custom loss function as shown in (1), where wi is the weight of ith task’s loss Li. The simplest method is uniform weighing (Gong et al., 2019).

$$\min_{\theta}{{\sum}_{i} w_{i} L_{i}\left(\theta\right)}$$
(1)

The construction of a training scheme is the last issue of MTL training implications. The conventional MTL model training process is to create mini-batches containing samples for a single task and then switch between tasks throughout training. The percentage of task mini-batches might be the same for all tasks, or it can vary depending on task performance or dataset size (Caruana, 1997; McCann et al., 2018; Ruder, 2017). During the training phase, we may switch between optimization processes (**ang et al., 2021) by having a suitable loss function for each task. Another issue that must be considered is how a task is expressed in the model. Each job in MTL has unique task-specific output layers for the task-specific outputs (Ruder, 2017). Instead of having multiple models per task and aggregating their results, the MTL model uses the shared parameter concept and reduces the complexity of the model. We only require task-specific heads for each task instead of having a new model per task, thereby reducing the overall complexity of the model.

4 Model description

We adopt an MTL model that can jointly learn and transform knowledge between toxic comment classification and toxic span prediction tasks. While a variety of techniques for MTL have been investigated in the past, the paradigm that has received the most interest in deep learning and natural language processing (NLP) in particular is hard parameter sharing. This MTL paradigm operates by sharing a fraction of the model’s parameters between multiple tasks. We will compare the performance of these MTL models to that of STL baseline models that do not share any parameters.

Both the toxicity classification and toxic span prediction are related tasks. To reap benefits from these inter-related tasks, similar to **ang et al. (2021) and Chen et al. (2019), we have built a multi-task learning neural network model to jointly train the model for both sequence classification and toxic span prediction tasks. MTL model for explaining toxicity label prediction shown is in Fig. 2.

Fig. 2
figure 2

Multi Task learning model for Toxicity Classification and Rationale Extraction

4.1 Problem statement

The social media post is designated by I = {w1,w2,…,wn}, where n is the sentence length. For any input sequence I, the task is to identify the class label ciC, where C = {non-toxic,toxic} and for each word wiI assign a tag \({y_{i}^{s}} \in Y^{s}\), where Ys = {B-T,I-T,O} to predict toxic span (rationale) of input sequence. To predict toxic spans, the BIO tagging scheme is used, where B-T (Begin) represents the first token in a toxic span, I-T (Inside) represents the inside and end tokens in a toxic span, and O represents the no-toxic tokens.

4.2 Proposed Model

A Transformer encoder is utilized to obtain the contextual embeddings of the input. To work with the transformer (e.g., BERT, XLM-Roberta), each input sequence is tokenized using the WordPiece tokenizer (Schuster and Nakajima, 2012). The special tokens viz., [CLS] and [SEP] are added at the beginning and end of the input sequence, respectively. [CLS] is used as a classification token and [SEP] is used to represent the end of the input sequence. Let Ct and Yt be the number of distinct class and labels, respectively. Given a tokenized input sequence \(I=\{\left [CLS\right ],w_{1},w_{2},\ldots ,w_{n},\left [SEP\right ]\}\), the output of the transformer encoder module will be O = {ECLS,E2,E3,…,EN} of size \([N,D_{o}] \in \mathbb {R}^{N\times D_{o}}\) where N = n + 2 and Do = 768. Do is the final hidden layer dimension of the transformer encoder.

$$O = TransformerEncoder(I)$$
(2)

The output of the transformer encoder contains contextual embeddings of each input token wi. The special token CLS embedding \(E_{CLS} \in \mathbb {R}^{D_{o}}\) is used for the classification task, as it contains the contextualized information of the entire input sentence, I. To avoid overfitting, we added a dropout layer with dropout rate 0.1 on the output of the transformer encoder. As shown in (3), this ECLS embedding is fed to a linear layer with Softmax activation to predict class label yc of the input sequence I.

$$y^{c}=Softmax\left({W_{c}^{T}}.E_{CLS}+b_{c}\right)$$
(3)

where \(W_{c} \in \mathbb {R}^{C_{t} \times D_{o}}\) is the weight matrix and \(b_{c} \in \mathbb {R}^{D_{o}}\) is the bias vector.

BiLSTM-CRF models have shown promising results for NER and sequence tagging tasks (Huang et al., 2015; Ma & Hovy, 2016; Zou & Li, 2021). The transformer encoder works in parallel on the entire input sequence. Hence, we use a stacked Bi-LSTM layer on the output of the transformer encoder to obtain position-sensitive embeddings as shown in (4). This is necessary to use the sequence of the input tokens in determining the toxic span.

$$B = BiLSTM(O)$$
(4)

Let \(B = \{B_{1},B_{2},...,B_{N}\} \in \mathbb {R}^{N-1 \times D_{b}}\) denote the output features retrieved by the BiLSTM layer, where Db is the hidden dimension of the BiLSTM layer.

$$S_{i} = LinearLayer(W_{s_{i}}^{T}.B_{i}), \ W_{s_{i}} \in \mathbb{R}^{Y_{t} \times D_{b}}, i \in \{2,{\dots} N-1\}$$
(5)

A linear layer is added on top of the BiLSTM layer to get the label score \(S_{i} \in \mathbb {R}^{Y_{t}}\) of each token wi. \(W_{s_{i}}\) is the weight matrix and \(S=\{S_{1}, S_{2}, {\dots } S_{N-1}\}\) is the set of label scores of the input sequence. Toxic span prediction is influenced by the surrounding word predictions. The purpose of the CRF layer is to decode the best label chain \(y^{s} = \{{y_{1}^{s}},{y_{2}^{s}},...,{y_{N}^{s}}\}\) using S. CRF benefits from considering the correlations between labels/tags in the neighborhood as a discriminant graphical model, which is extensively utilized in sequence labelling or tagging applications (Ma & Hovy, 2016). The probabilistic CRF model defines a family of conditional probabilities p(y|S) on all possible label sequences y given S.

We employ maximum conditional likelihood estimation for CRF training. The log-likelihood function is given by (6) and maximum likelihood training updates the parameters to maximize the log-likelihood L.

$$L = \sum\limits_{i}{log\ p(y\vert S)}$$
(6)

The goal of decoding is to find the label sequence ys with the highest conditional probability

$$y^{s} = \arg\!\max_{y \in Y(B)}p(y\vert S)$$
(7)

where Y (B) represents all possible output sequences for input I. The CRF layer uses the Viterbi algorithm (Viterbi, 2009) to decode the label sequence by determining the most likely sequence of hidden states with the best posterior probability estimates.

4.3 Loss and model training

To jointly learn from both classification and span prediction tasks, the loss function is given by:

$$Loss=\alpha\sum L_{TC}+\left(1-\alpha\right)\sum L_{TSP}$$
(8)

The weight parameter α controls the importance of each task.

The maximization of (6) is converted to a minimization problem by taking a negative log-likelihood (− L). The model is trained end-to-end, by minimizing a weighted loss of both tasks. Cross entropy loss (LTC) and CRF loss (LTSP = −L) are used for classification and toxic span prediction tasks, respectively. The overall training process of the model is summarized in Algorithm 1

figure a

5 Experiments

We evaluated our models on the curated dataset for classification performance and on the TSD dataset for toxic span prediction performance. To test for domain adaptation, we evaluated our models on the HASOC and OLID English datasets.

5.1 Datasets

We collected four publicly available datasets, two of these are used to curate new dataset to enable MTL, and the remaining two are used to test domain adaptation and generalizability of the model on unseen dataset. Each dataset is described in the following sub-sections.

5.1.1 Kaggle-JigsawFootnote 11

Civil Comments platform used crowd-sourced moderation and advanced community management tools to bring real-world social cues to comment sections. Civil Comments is the first commenting platform created with the goal of improving how people interact online. It was shut down at the end of 2017.Footnote 12 They have provided nearly 2 million public comments in an open archive so that researchers could better understand and promote civil discourse in online interactions. Jigsaw funded the project and had human raters annotate the data for several toxic conversational characteristics: toxic, severe toxic, obscene, threat, insult, identity hate. We converted this dataset into a binary classification dataset for our training and testing purposes by merging all forms of toxicity into a single class to suit our problem statement.

5.1.2 TSD

The toxic span detection dataset (Pavlopoulos et al., 2021) is a subset of the Jigsaw dataset containing 10k toxic comment samples labelled with toxic spans. Each sample of this dataset is annotated by three individuals. A span is labelled as toxic only when at least two annotators label it as toxic. The annotation process is done by the SemEval-2021 and released the dataset for Task5.

5.1.3 Curated dataset

The Kaggle-Jigsaw dataset contains only the class labels for the whole text sequence, and it contains both toxic and non-toxic sequences. The TSD dataset contains only toxic posts with their toxic span information, hence do not contain the class label as shown in Fig. 3. As MTL requires both, we curated a dataset that contains class labels and span information from Jigsaw and TSD datasets. We collected the top 20 words which are part of toxic spans in the TSD dataset, and collected non-toxic posts containing these words from the Jigsaw dataset. The inclusion of non-toxic posts with toxic words will make the model learn the non-toxic usage of toxic words (**ang et al., 2021). When collecting Jigsaw samples, we removed the samples that overlapped with the TSD dataset. The TSD dataset contains nearly 10k samples. To have balanced non-toxic posts, a total of 19k posts were collected from the Jigsaw dataset out of 1.8 million posts, in which 16k were non-toxic and 3k were toxic. While curating the dataset from the Jigsaw, samples having toxicscore ≤ 0.1 are considered as non-toxic and toxicscore ≥ 0.8 are considered as toxic posts. 10k posts were collected from the TSD dataset and all the posts are treated as toxic.

Fig. 3
figure 3

Samples from Jigsaw contains toxic and non-toxic posts and do not contain toxic span ground truth. Samples from TSD dataset are all toxic and contain toxic span ground truth. The curated dataset is a mixture of TSD and Jigsaw datasets. The toxic span ground truth is empty set for non-toxic samples. It is unknown for the toxic samples of Jigsaw dataset. The toxic span is shown in red color. Content disclaimer: This figure and some of the subsequent pages contain toxic content which may be disturbing to some of the readers, the toxic content is used only for illustration purposes

All posts in the TSD dataset have toxic span information. For all non-toxic posts from the Jigsaw, the span information is empty, which means there is no toxic span to learn from non-toxic posts. Therefore, these posts are not included in the test and trial set, as they are not useful for evaluating the span prediction performance. To have balanced toxic and non-toxic posts in the training set, we followed the train-test-valid splits as shown in Table 1. Hence, the training set has nearly 11k toxic posts and 12k non-toxic posts.

Table 1 Curated dataset distribution: 10K samples collected from TSD, 19k samples collected from Jigsaw and split into train, test and valid sets

5.1.4 HASOC and OLID

The HASOC and OLID datasets comprise tweets that have hierarchical annotations. The first level annotations contain the labels offensive and not offensive. The next level annotations are for the type of offensive. In our evaluation, we used the first level annotations, which indicate whether the tweet is offensive or not. We use the test samples from OLID and, HASOC to evaluate model performance for out of domain data. From Table 2, it is observed that social media post (text) length and number of words per post of HASOC and OLID datasets are small compared to the curated set.

Table 2 Statistics of three datasets (Curated, HASOC, OLID)

5.2 Pre-processing

The pre-processing stage eliminates URLs and combines repeated strings into a single character (e.g., “!!!!!!” is changed to “!”). We remove white spaces in toxic offsets (e.g., in the text “you are astupid”, the leading white-space is also marked as part of the toxic span) and, singleton offsets (e.g., in the text “ab usive speech,” only ‘b’ is marked as a toxic offset). These are removed as inconsistencies of the annotator. Following these steps, the text is tokenized with a custom tokenizer to preserve terms like “a$$” rather than tokenizing it as ‘a’, ‘$’, ‘$’. After tokenization, B-T (begin toxic) is allotted to token if it is at the beginning of the toxic span and an I-T (inside toxic) allotted if it is within the toxic span; otherwise, the O is assigned (other). O is also assigned for all tokens in non-toxic comments. These tokens serve as ground-truth tokens for the CRF layer. To get a fixed length input sequence, we use post-padding for shorter sequences and truncation for longer sequences.

5.3 Model implementation

For comparison, we have built baselines that model a classification task alone and span prediction task alone.

5.3.1 Classification Task

For these models, we use token embedding [CLS] for sequence classification. As shown in Fig. 2, E[CLS] embedding is fed to the Softmax layer to predict the class label of the input sequence. All transformer-based sequence classification models are trained with a cross-entropy (LTC) loss function.

5.3.2 Span prediction task

For these models, we use all the token embeddings for span prediction. As shown in Fig. 2, all token embeddings (Ei) are fed to the CRF layer. All transformer-based span prediction models are trained to minimize span prediction loss (LTSP). These models are trained on samples that contain only toxic span information.

5.3.3 Multi-task

The multi-task models are constructed as shown in Fig. 2 they are optimized using the loss function L indicated in (8). We use the joint loss (L) to compute the loss for the samples containing both post- and span-level labels. We compute only the classification loss LTC for samples that contain only class labels for the entire text. During training, we created interleaved batches of both types of samples to ensure a balance update on the parameters.

5.3.4 BERT-MT

The BERT-MT model as proposed by **ang et al. (2021) is replicated according to their description of model and hyperparameters. In this approach, input sequence is tokenized and fed to the BERT layer. On top of the BERT, a linear layer with sigmoid activation function is used for each token to find the toxic score of the token. The entire sequence toxic score is the maximum toxic score of its tokens. These scores are used to identify the class label and toxic span of the input sequence. The model is trained end-to-end using joint MSE loss.

5.4 Hyperparameters

We experimented with base case versions of BERT, RoBERTa, XLM-RoBERTa and their fine-tuned variants as embedding models. Hyperparameters were tuned using the development set. For all transformer models, the input sequence length is 512 tokens. Adam optimizer is used with a learning rate of 5e− 05. The dropout rate is set to 0.1. The batch size is set to 24 and each model is trained end-to-end for 5 epochs. In the loss function (8), α is set to 0.5 to give equal importance to both tasks. Figure 4 shows the influence of α on the two tasks. The best performance is observed at α = 0.5. The influence of α is low on the classification task as the class labels can be predicted from the span predictions.

Fig. 4
figure 4

Classification and Span Prediction F1 score vs. weight factor (α) of the loss function for Toxic XLMR – multi-task model

6 Result analysis

We ran a series of experiments to see how the three commonly used pretrained language models and their fine-tuned versions affected the overall performance of the proposed MTL system. This section provides the results of the proposed MTL and the baseline models for the toxic comment classification and toxic span prediction tasks.

6.1 Classification performance

The experimental results of MTL and STL for classification (STL-C) are shown in Table 3. Toxic versions of BERT, RoBERTa and XLMR are the fine-tuned versions of BERT, RoBERTa and XLMR respectively that were trained for the classification task. These are available at Hugging Face.Footnote 13 ToxicXLMR based multi-task model achieved the best performance compared to all the models we have evaluated. From the results, it is observed that the ToxicXLMR based multitask model improves accuracy by 4% compared to the STL-C model. It is also observed that the ToxicXLMR based models performed better than the RoBERTa and BERT based models and by applying the multi-task models, there is at least a 1% improvement in classification accuracy compared to the STL-C models across the transformer variants. The classification performance of BERT-MT (**ang et al., 2021) is comparable to the proposed model using BERT and ToxicBERT transformers. But, the proposed model with XLMR transformer has significantly better accuracy and F1 score (4%) than this BERT-MT.

Table 3 Model Evaluation Results on Curated dataset

6.2 Span prediction performance

To evaluate our models for the toxic span prediction task (rationale extraction), the TSD (SemEval-2021) dataset was used. The ad-hoc assessment measure proposed by Da San Martino et al. (2019) was used to evaluate span prediction performance. This was officially used to rank SemEval-2021 Task 5 submissions. The ad-hoc assessment metric provides partial credit to incomplete character matches. For a document d, Sd is the set of toxic character offsets predicted by the system and Gd is the set of ground truth annotations. Then the \({F_{1}^{d}}\) score of the system for document d is defined as

$${F_{1}^{d}}=\ \frac{2\ast P^{d}\ast\ R^{d}}{P^{d}+\ R^{d}},\ where\ P^{d}=\ \frac{{\vert S}_{d}\ \cap\ G_{d}\vert}{\vert S_{d}\vert}\ and\ \ R^{d}=\ \frac{{\vert S}_{d}\ \cap\ G_{d}\vert}{\vert G_{d}\vert}$$
(9)

When a document does not have a ground truth annotation (Gd = ), or the system does not output character offset prediction (Sd = ), then (10) is used.

$${F_{1}^{d}}= \begin{cases} 1\ when \ G_{d}=S_{d}=\emptyset,\\ 0 \ otherwise \end{cases}$$
(10)

Finally, the average score of F1 on all test samples of an evaluation dataset is the model F1 score.

The F1 score of the proposed multi-task model is on par with the best ensemble models of the leader board for SemEval-2021 Task 5 (Pavlopoulos et al., 2021). As shown in Table 5, its F1 score is 69.38 while the top two ensemble models’ F1 scores are 70.83 and 70.77, respectively. HITSZ-HLT (Zhu et al., 2021) and S-NLP (Nguyen et al., 2021) are the ensembles of three transformer-based models which has 3x inference time compared to the proposed MTL model. Benchmark 1 (Nguyen et al., 2021) in Pavlopoulos et al. (2021) used a fine-tuned version of RoBERTa and achieved a competitive F1 score, but can only identify the toxic span and is not applicable for the classification of toxic comments. One requires to create a pipeline of toxic comment classification and toxic span prediction models to identify both the label and toxic span of the input sequence for Benchmark 1. Whereas the proposed single model can do both tasks with a competitive F1 score for toxic span prediction and toxic comment classification. BERT-MT (**ang et al., 2021) model’s span prediction score is very low compared to ToxicXLMR-MTL model as shown in Table 5. This result shows the significance of Bi-LSTM CRF model for span prediction compared to the individual token classification as done in **ang et al. (2021).

6.3 Generalization Ability of the Model

We are interested in determining the impact of the coverage of captured phenomena by the models. A classifier trained on a larger coverage dataset and tested on a smaller coverage dataset will give a good performance according to Pamungkas and Patti (2019). Hence, we tested the proposed MTL model on HASOC and OLID datasets which are not exposed to the model during training. We performed the Mann-Whitney U test (Nachar, 2008) on these three datasets to find the overlap of word distributions. We found that 30% of the words of curated and HASOC datasets, and 23% of curated and OLID datasets are having different distributions with confidence of 95%. This shows the dissimilarity of curated, HASOC and OLID datasets. From the results shown in Table 6, it can be observed that MTL models have better results over STL-C models. We have observed 3% and 6% improvement in weighted F1 scores of all the MTL variants compared to the STL-C variants on HASOC and OLID datasets, respectively. The best published weighted F1 score on the HASOC dataset is 83.95 while our best multi-task model’s (ToxicXLMR-MTL) weighted F1 score is 80.76 (Wang et al., 2019). Similarly, the best published macro F1 score on the OLID dataset is 83 (Liu et al., 2019a) and our best multi-task model’s score is 77.53.

7 Discussion

This section provides a discussion about suitable pretrained language models, the effect of their fine-tuned versions, the performance impact of the MTL models compared to STL models, and the error analysis of the top-performing model.

7.1 Influence of pretrained language models

We conducted extensive experiments with several pretrained language models in order to analyze the performance of the shared parameter mechanism used in the proposed MTL model. From the results, first, we show that integrating pretrained language models with MTL improves the effectiveness of toxic span prediction and toxic comment classification. The pretrained model utilized 12 transformer layers constructed from a multi-head attention network, which assist in extracting contextual information of tokens. The MTL allows the system to share knowledge between tasks and exploits knowledge from TSP task to boost TCC task. From the results (Tables 3 and 4), it can be observed that the type of the transformer model and the corpus utilized to pre-train and fine-tune also effects the performance of the proposed model. The fine-tuned versions of XLMR and RoBERTa showed improvement compared to their base versions. For example, ToxicXLMR-MTL has shown a 1.8% and 1% improvement over XLMR-MTL for TCC and TSP tasks, respectively, in terms of macro F1 score. It can also be seen that from Table 6, ToxicXLMR improved the domain adaptation performance by 6% and 10% compared to BERT when tested on HASOC and OLID test sets, respectively, in terms of accuracy and F1 score.

Table 4 Evaluation Results for Toxic Span Prediction task on TSD test dataset

7.2 Performance evaluation with baselines

We compared the performance of the proposed MTL system with two baseline models built on the single task learning (STL) methodology. The results suggest that MTL outperformed the STL baseline model in terms of accuracy and F1 score. For instance, MTL (ToxicXLMR-MTL) model shown 4% and 1.7% improvement in F1 scores compared to STL (ToxicXLMR) models for TCC and TSP tasks, respectively. ToxicXLMR-MTL has shown 9% improvement for toxic span prediction task compared to BERT-MT (**ang et al., 2021). When the models are evaluated on an unseen dataset to test for domain adaptation ability, the MTL model (ToxicXLMR-MTL) has shown 3% and 8% improvement in macro F1 score on HASOC and OLID, respectively, compared to the STL model (ToxicXLMR-STL-C). We also compared the proposed MTL model with the top-ranking ensembles of SemEval-2021 task 5 for the TSP task. The MTL model (ToxicXLMR-MTL) has shown a competitive F1 score (Table 5).

Table 5 Comparing our model with best performing models on the TSD dataset

7.3 Error analysis

In spite of performance, the proposed MTL model is unable to handle a few errors. In these sections, we examine the errors of the best-performing model (ToxicXLMR-MTL).

7.3.1 Classification analysis

Figure 5 shows the confusion matrices of the predicted class labels against the ground truth class labels for both ToxicXLMR-MTL and ToxicXLMR-STL-C models (these two models are performing best, can be seen in Tables 3 and 6). From Fig. 5 it can be observed that false negatives (toxic comments predicted as non-toxic) and false positives are high in the ToxicXLMR-STL-C compared to the ToxicXLMR-MTL.

Fig. 5
figure 5

left: Confusion matrix for multi-task model, right: Confusion matrix for classification task alone model; in the figure, 0 means non-toxic and 1 means toxic

Table 6 Model Evaluation Results on both HASOC and OLID for domain adaptation

7.3.2 Analysis of false positives

A qualitative analysis of the results of the ToxicXLMR-MTL model revealed that false positives (non-toxic predicted as toxic) are due to negative words such as racist, screw, traitor, stupid, troll, etc. The following five comments (a to e) shown below are actually non-toxic. However, the ToxicXLMR-MTL model predicted them as toxic due to the presence of negative words. The toxic spans identified by the model are shown in bold. For example, comment (b) is labelled as toxic due to the presence of the word “pussy” (a negative word). We are able to identify the cause for this misclassification due to the identified toxic span by the model. The mere presence of negative words may not make a comment as toxic. From Fig. 5, we can observe that the MTL model has low false positives (168) when compared to the STL model (182) for the curated dataset. The BiLSTM CRF layer considers the sequence of text in determining the toxic span. The use of joint loss which gives equal importance to both classification and span prediction tasks have a positive influence on results of both the tasks. However, the importance of negative words as learned by the model dominates the non-toxic usage of toxic words.

  1. a)

    Sounds like racism and xenophobia to me. I personally am not afraid of brown and black people but I am completely nauseated by people like you

  2. b)

    and that’s what you got out of the whole thing - you say pink pussy(cat) hats - what are you.

  3. c)

    morally disgusting

  4. d)

    Getting screwed by a bank. who would have thought

  5. e)

    The Lynch comment refers to a bout of renaming in the Centennial School District. We also have another school district in another state renaming a school from Jefferson Davis High to the name of another traitor, Barry Obama

The false positives and false negatives occur due to mere presence of doubtful labels, toxicity without swear/profane words, sarcasm, irony, and usage of swear words in non-toxic context (Van Aken et al., 2018). We wish to study the effect of semantic embeddings to address these challenges in the future.

7.3.3 Analysis of false negatives

The major source of false negatives (toxic comments predicted as non-toxic) can be attributed to lack of context. As machines are unaware of the context of a particular sentence used and indirect toxicity, it is hard to predict that the sentence is toxic, even though it is toxic. Sometimes it is even hard for humans to figure out such implicit toxicity without the knowledge of the context of the discussion. The current models are far from the human understanding of language.

The following four are the examples for toxic comments labelled as non-toxic by ToxicXLMR-MTL.

  1. a)

    I do. My father shot at them and I’d be happy to follow in his footsteps. Waste of good lead, though

  2. b)

    I agree. I think we should stop calling them “alt-right” and start calling them Nazis instead

  3. c)

    Ignorance must be bliss for you

  4. d)

    Good job smelly. way to attack the message and not the messenger

7.3.4 Span prediction analysis

An analysis of the predicted spans by ToxicXLMR-MTL and ground truth toxic spans is provided here, with example comments. Ground truth spans are bold-faced, and model predicted spans are underlined. For these example comments, the ToxicXLMR-STL-SP model predicted an empty span. From these examples, we can observe that the MTL model is able to identify subtle toxicity based on the context to some extent, though the model doesn’t actually understand the text as humans do. It can be observed that comment a is toxic due to the word depraved, but the human annotator used a different word homosexual. This disparity is due to the difference in human annotator’s opinion in tagging the toxic span.

  1. a)

    Yes they would, then hey could oh so self righteous about how depraved all homosexuals are and this just proves it! (Prediction matches the ground truth.)

  2. b)

    The 2nd Amendment is a white privilege. Black folks are shot on the mere suspicion of being armed. That includes black children (Predicted span is a subset of ground truth.)

  3. c)

    You’re as full of excrement as Trump, your malignant narcissist hero (There is a mismatch between predicted and ground truth spans. As both of these words are actually toxic, the model might have picked one among them.)

  4. d)

    How much longer does this country have to suffer this disgusting moron as president (Predicted span is subset of ground truth.)

  5. e)

    Blah, Blah, Blah. The Obama administration sucked, end of story (Prediction matches the ground truth.)

8 Conclusion and future work

This paper demonstrated a multi-task model that performs toxic comment classification while predicting the toxic spans as rationales. We have curated a dataset containing both class label and toxic span information to train the MTL model, as no such data set exists as of date. The multi-task model is built using a transformer (XLMR) based Bi-LSTM CRF architecture. The proposed model has better classification and span prediction performance, as the model uses joint loss function over the related tasks. The model exhibited a competitive F1 score on the SemEval-2021 Task 5 dataset for the toxic span prediction task when compared to the models in the leaderboard. The proposed model has shown a 4% improvement in F1 score for the classification task when compared to the STL model. It is observed that the type of the transformer model and the corpus utilized to pre-train and fine-tune also effects the performance of the proposed model. The empirical evidence of testing on out of domain datasets HASOC and OLID shows that the proposed model is effective for both in-domain and out-domain evaluation. In the future, we intend to develop models using semantic embeddings that take more delicate context and actors of text into account in order to handle the subtle differences in the usage of toxic keywords.