Benchmarking a large Twitter dataset for Arabic emotion analysis

El-Sayed, Ahmed; Abougabal, Mohamed; Lazem, Shaimaa

doi:10.1007/s42452-023-05437-1

Benchmarking a large Twitter dataset for Arabic emotion analysis

Case Study
Open access
Published: 29 July 2023

Volume 5, article number 224, (2023)
Cite this article

Download PDF

You have full access to this open access article

SN Applied Sciences Aims and scope Submit manuscript

Benchmarking a large Twitter dataset for Arabic emotion analysis

Download PDF

Ahmed El-Sayed¹,
Mohamed Abougabal¹ &
Shaimaa Lazem²

1208 Accesses
Explore all metrics

Abstract

The scarcity of available annotated Arabic language emotion datasets limits the effectiveness of emotion detection applications. Techniques such as semi-supervised self-learning annotation and transfer learning from models trained on large annotated datasets have been increasingly considered as alternative economic options for researchers working on Arabic sentiment and emotion detection tasks. Examining the quality of the data annotated using these techniques is particularly important in applications that require detecting emotions with high granularity such as mental health applications. This paper contributes an approach to benchmarking a semi-supervised self-learning annotated Arabic emotion large dataset. By extracting the lexical correlation of each emotion, and conducting content analysis, the quality of the annotation approach is demonstrated. Further, using a comprehensive set of experiments, we evidence the effectiveness of the transfer learning approach from the large dataset to smaller datasets in emotion and sentiment classification tasks.

Article highlights

Semi-supervised self-learning annotation is a promising and economically viable approach to annotating large Arabic datasets.
MARBERT model outperforms AraBERT in emotion classification tasks conducted on the large Egyptian dialect dataset ArECTD.
Transfer learning from ArECTD pre-trained model generalizes to smaller datasets in emotion and sentiment classification tasks.

Building Large Arabic Multi-domain Resources for Sentiment Analysis

Hotel Arabic-Reviews Dataset Construction for Sentiment Analysis Applications

MAC: An Open and Free Moroccan Arabic Corpus for Sentiment Analysis

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

In recent years, the amount of user content available on the Internet has dramatically grown allowing people to debate and share their thoughts and feelings [1]. Emotions play a crucial role in human communication and decision-making processes. Social media platforms have become an important source for understanding and detecting public opinions using sentiment and emotion analysis [2].

Text-based emotion analysis [3] is regarded as an elaborate form of sentiment analysis, where emotions rather than the polarity or the sentiment of the content are identified and analyzed. Machine and deep learning algorithms are trained to categorize text into different emotions such as happiness, sadness, and anger. emotion analysis holds great potential in understanding and analyzing emotions expressed in textual data. Its importance is evident in applications such as understanding customer reviews, depression detection from social media posts, and monitoring political reactions on social media [2].

There are three main types of emotional state representation [4]: discrete, dimensional, and componential. Discrete models, such as Paul Ekman’s model [5], classify emotions into six basic categories: anger, disgust, fear, happiness, sadness, and surprise. Dimensional models represent emotions as points in a two-dimensional or three-dimensional space. The pleasure, arousal, and dominance (PAD) model [6] is widely recognized as the most famous three-dimensional emotion model. Componential models consider various factors that contribute to or influence an emotional state, with combinations of basic emotions forming more complex emotions. For example, Robert Plutchik’s model [7] includes eight basic emotions, and other emotions, such as love and trust, are considered combinations of these basic emotions.

The state-of-the-art accuracies of emotion classification tasks in the Arabic language are low compared to English [1]. This is despite the fact that Arabic is the fourth language used on the Internet and social media [8]. One important reason for the lag in the research of Arabic language processing despite the plethora of available social media data is the cost of data annotation. Emotion-annotated Arabic datasets are scarce [1, 9]. The cost of the annotation process is a major hindrance to the generation of large emotion datasets and subsequently using emotion detection in real-life Arabic applications. More and larger datasets are required to advance the research and applications in this domain.

Recent efforts by El-Sayed et al. [10] and Al-Laith et al. [11] have used semi-supervised self-learning techniques to annotate large Arabic datasets. This approach utilizes a manually annotated subset of data to automatically label the remaining unlabeled part of the dataset. The approach holds great potential due to its low-cost compared to fully manual annotation. Nonetheless, examining the quality of the semi-supervised self-learning annotation approaches is necessary especially if the target applications attempt to distinguish between fine-granularity emotions such as sadness and anger such as in the case of depression detection applications. To the best of our knowledge, none of the research efforts that used this annotation approach in annotating Arabic datasets have provided details on the quality of the annotated data.

This paper aims to fill this gap by examining the quality of a large Arabic dataset, ArECTD, that was annotated using a semi-supervised self-learning approach [10]. First, the linguistic quality of the annotation is evaluated. Second, transfer learning has been used to study the generalization from ArECTD model to other Arabic datasets. Transfer learning is another practical alternative for researchers who only have access to small annotated Arabic datasets, whereby they could leverage the power of a large specialized pre-trained emotion model instead of training a model from scratch on a small dataset. Our study is evidence for the quality of the semi-supervised self-learning annotation approach.

The remainder of this paper is structured in the following manner. Section 2 reviews the related work. An overview of the ArECTD dataset and the annotation process is presented in Sect. 3. The methodology for benchmarking the ArECTD dataset is illustrated in Sect. 4. The results are depicted in Sect. 5, and discussed in Sect. 6. Finally, we conclude and discuss possible future directions in Sect. 7.

2 Related work

This section reviews the previous work on Arabic emotion analysis from text, summarizing and comparing the different available emotion datasets.

Alqahtani and Alothaim [1] provided a comprehensive survey of emotion detection in Arabic data. The available models, resources, datasets, and tools for Arabic emotion classification of tweets, were presented. Further, the survey discussed the available pre-trained language models that can be fine-tuned for the emotion detection task. One of the reported challenges was the limited dialect resources or the datasets from Arab countries with certain dialects. Develo** transformer models that recognize local Arabic dialects would have a positive impact on improving real-life applications of Arabic language models.

Abdul-Mageed et al. [12] proposed two new Arabic-specific BERT models, AraBERT and MARBERT. BERT [13], short for Bidirectional Encoder Representations from Transformers, is a transformer-based language model developed by Google. It is pre-trained on a large corpus of text and can be fine-tuned for various Natural Language Processing (NLP) tasks. AraBERT is a BERT-base architecture that was trained using 61GB of Arabic text. A limited amount of the training data for AraBERT is in the Egyptian dialect, with most of it being in Modern Standard Arabic (MSA). MARBERT is another BERT-base architecture that aimed at increasing the capacity of transformer models to handle dialectal Arabic. The training data was enriched by adding a set of 1 Billion Arabic tweets. The final training dataset had around 128 GB of text, with tweets making up over half of it. MARBERT has been evaluated in sentiment analysis, social meaning prediction, topic categorization, dialect identification, and named entity recognition, among other NLP tasks.

One of the earlier works on Arabic emotion analysis is the research by Al-Khatib and El-Beltagy [14] who collected an Arabic dialectical dataset using the “Olympics" hashtags in the period between Jul 2016 and Aug 2016 originated from Egypt. It consisted of 10K tweets and was manually annotated into eight emotions: sympathy, joy, love, anger, sadness, anger, fear, surprise, and none. Machine learning classifiers were tested on this dataset, and the best results were achieved by the Complement Naive Bayes classifier with an overall accuracy of 68.12%.

El-Sayed et al. [15] further proposed an improved pipeline to the one proposed in [14] using additional preprocessing techniques to handle emojis and experimenting with multiple word embeddings and comparing the performance of machine learning and deep learning models. Using AraBERT word embedding and AraBERT deep learning model outperformed the tested techniques with an accuracy of 75.8%. Further, a weighted voting ensemble of Logistic Regression, SVM, AraBERT, and GRU achieved an accuracy of 75.88%.

Alaa Mansy et al. [16] proposed an ensemble deep learning approach for emotion detection in a small dataset. The proposed model tested three deep learning models: Bi-LSTM, Bi-GRU, and MARBERT. The experiments were evaluated using the SemEval-2018-Task1-Ar-Ec dataset, which was collected using emotion words, and annotated using 11 class labels (anger, anticipation, disgust, fear, joy, love, optimism, pessimism, sadness, surprise, and trust). It was split into 2,278 tweets for training, 585 tweets for development, and 1,518 tweets for testing. Grid search was used to determine the best threshold and weights for the ensemble model. The best threshold was 0.34 and the best weights were 0.72, 0.1, and 0.18 for MARBERT, BiLSTM, and BiGRU respectively. The experiments achieved macro F-scores of 0.692, 0.653, and 0.664 for MARBERT, BiLSTM, and BiGRU respectively. The ensemble model outperformed the individual models achieving a 0.701 F-score.

Al-Laith and Alenezi [17], collected 5.5 million Arabic tweets on the COVID-19 topic in the period from January to August 2020. This large dataset was annotated using combined rule-based and neural network automatic approaches. LSTM was used to classify all of the tweets into six emotions (anger, disgust, fear, joy, sadness, and surprise) and two tweet types (symptom and non-symptom tweets) achieving an 83% F1 score for emotion classification and 75% F1-score for symptom classification.

Al-Laith et al. [11] introduced a semi-supervised self-learning method aimed at expanding an Arabic sentiment annotated corpus using unlabeled data. The technique involved training a neural network with a set of models on a manually labeled dataset of 15,000 tweets. These models were then utilized to extend the corpus into a much larger Arabic sentiment corpus which consisted of 4.5 million tweets. To train and evaluate the final corpus, a long-short-term memory (LSTM) deep learning classifier was employed achieving an accuracy of 70%.

El-Sayed et al. [10] collected an Egyptian Arabic COVID-19 Twitter dataset, in the period from January 2020 to May 2021. The dataset contained approximately 78,870 tweets and has been annotated using a hybrid approach that combined manual and semi-supervised methods. The annotation process involved assigning ten commonly observed emotion labels to the tweets, namely sarcasm, sadness, anger, fear, sympathy, joy, hope, surprise, love, and none.

Omara et al. [18] proposed using a transfer learning approach to mitigate the lack of Arabic emotion-annotated datasets. Using transfer learning, the learned knowledge in a sentiment analysis task was used to conduct emotion classification under the hypothesis that both domains share a similar feature space. The sentiment dataset consisted of both MSA and dialectal Arabic. The emotion dataset was constructed from two Arabic datasets. The first was collected in [14], while the second was used in the SemEval-2018 task [19]. Emotion classification with CNN and transfer learning achieved an accuracy of 95.24% compared to the Naive Bayes model (68.12%).

Transfer learning in emotion detection tasks has been further examined in other languages. Demszky et al. [20] introduced a large English manually annotated dataset, GoEmotions, consisting of 58k Reddit comments over 27 emotion categories including a neutral label. BERT-base model was fine-tuned on the dataset achieving F1-score of 0.46. Transfer learning experiments were conducted with existing emotion benchmarks to show that the dataset generalizes well to other domains and different emotion taxonomies. Three different fine-tuning setups, namely Baseline, Freeze, and NoFreeze were conducted. In the Baseline setup, BERT was fine-tuned only on the target dataset. In the Freeze setup, BERT was fine-tuned on GoEmotions, then transfer learning was performed by replacing the final dense layer, freezing all layers besides the last layer, and fine-tuning on the target dataset. The NoFreeze setup is the same as Freeze, except they did not freeze the bottom layers.

Table 1 summarizes and compares the details of the Arabic emotion datasets in the literature. A close study of the table shows the small number of available emotion datasets. Manually annotated datasets are small in size just under 10K of tweets. The larger datasets [10, 11, 17] were not labeled manually. However, the quality of the annotation was not studied qualitatively. It is important to evaluate the quality of non-manual annotation approaches especially if the models are used in real-life emotion detection applications to detect high granularity emotions. Related work additionally shows that transfer learning is a possible alternative to handle cases where the training data is limited in size. The proposed research attempts to examine the quality of a semi-automatically annotated dataset and to explore the effectiveness of transfer learning from the model generated using this dataset to smaller datasets. We take ArECTD as a case study as the only Arabic dataset that is labeled for emotion detection using a semi-supervised self-learning technique.

Table 1 Summary of Arabic Emotion Datasets

Full size table

3 ArECTD dataset

ArECTD is a large-scale Arabic Egyptian dialect dataset [10] that was collected from Twitter in the period from the 1\(^{\hbox {st}}\) of January 2020 till the 30\(^{\hbox {th}}\) of May 2021 focusing on COVID-19 Egyptian dialect tweets. TwitterScraper [21] was used to obtain 1,597,939 tweets using COVID-19 hashtags, which were then cleaned by filtering out non-Arabic tweets, tweets from news pages, and non-Egyptian tweets. The cleaned version of the dataset contained 78,870 tweets authored by 56,176 unique users. The tweets of ArECTD were classified into ten labels: sadness, fear, sarcasm, sympathy, anger, surprise, love, joy, hope, and none (neutral).

The dataset was annotated using a combination of manual annotation and a semi-supervised self-learning technique. Around 11,000 tweets were manually annotated by a group of Computer Science undergraduate students. The self-learning process, shown in Fig. 1, used three different classifiers: Logistic Regression, AraBERT, and GRU which were trained on the manually annotated dataset. Each unlabeled tweet was processed by the three classifiers. If the tweet had a prediction probability greater than or equal to a threshold of 0.8 for a certain emotion by the three classifiers, it was assigned to this emotion label, removed from the unlabeled set, and added to the training set. The process was repeated until the unlabeled set was empty. The final distribution of emotions of the dataset after the annotation is shown in Fig. 2.

4 Methodology

The methodology used to examine the quality of annotation technique is illustrated in the following subsections.

4.1 Examining the linguistic quality of the annotation

The linguistic and content quality of the annotation of (ArECTD) [10] were studied using two steps. First, using the approach proposed in [20], the words that are highly correlated with a certain emotion class in contrast to all other classes were identified using the log odds ratio of all words emotions [22]. The words were inspected by the authors to flag the words that do not conform with the emotion label.

Second, the content of the dataset was examined using a manual comprehensive inspection of the dataset as follows. The process started by identifying and collecting the list of trending topics during the time of dataset collection with help from the website [23]. The tweets in the dataset were manually inspected, and where possible were assigned to the trending topics using judgment sampling. Judgment sampling [24] is a non-probability sampling technique that involves selecting samples based on the researcher’s judgment or expertise. In this sampling technique, the researchers select samples that they believe are representative of the population or which possess the desired characteristics for the study. We leveraged our local knowledge and expertise as well as the date of the tweets to judge whether a tweet should belong to a certain trending topic. Each topic had a number of emotional responses represented by the relevant tweets. We identified the dominant emotion for each of the trending topics and checked them for plausibility (i.e., the emotion was a reasonable response to the topic, for example, it would be odd to have love or joy as a response to a topic such as a lockdown). By examining the distinctive words in each emotion class and the tweets of the dominant emotions in the trending topics, we were able to evidence the quality of the annotation approach.

4.2 Examining ArECTD generalizability using transfer learning

Transfer deep learning is a methodology by which the knowledge that has been learned in a deep learning model from a source domain can be reused in a different but related target domain [18]. Recently, deep transfer learning models have attained state-of-the-art outcomes due to the fact that a deep learning model can learn more domain-invariant features [25]. These models have already been pre-trained on a huge unlabeled corpus, so they can be fine-tuned on available small datasets of labeled data for different NLP applications [25].

Given the scarcity of large Arabic emotion datasets such as ArECTD, we examined how the deep learning models developed for ArECTD might be used in sentiment and emotion analysis tasks where the labeled data in that target domain is limited. Based on the results from the literature [10, 12, 14,15,16,17,18] we hypothesized that transferring knowledge from ArECTD models to the smaller dataset would increase model accuracy for the target domain. Transfer learning experiments were conducted on two target datasets. Dataset 1 [14] contains about 10K Arabic tweets, mostly using Egyptian dialect, which was annotated into eight categories. Namely, sympathy, joy, love, anger, sadness, anger, fear, surprise, and none. Dataset 2 [26] is an Arabic sentiment analysis dataset. It consists of about 10K tweets, which were annotated using four labels: objective, subjective positive, subjective negative, and subjective mixed. Objective refers to neutral statements without personal opinions, while subjective positive indicates positive sentiments or opinions. Subjective negative is used for negative sentiments or opinions, and subjective mixed represents a combination of both positive and negative sentiments.

To develop the ArECTD transfer learning model, a comparison between two state-of-the-art Arabic deep learning transformer models, AraBERT [27] and MARBERT [12] were employed in the pipeline described in [15] and illustrated in Fig. 3. The models were evaluated based on accuracy, precision, recall, and weighted F-score metrics. The dataset was split into 60% training, 20% validation, and 20% testing. The hyper-parameters for AraBERT and MARBERT models were tuned, and the specific values used in the experiments are shown in Table 2. The results indicated that MARBERT outperformed AraBERT, leading us to select MARBERT for subsequent transfer learning experiments.

Subsequently, three transfer learning experiments were conducted on the target datasets. In the first experiment, Experiment I, MARBERT was fine-tuned on each of the target datasets independently. This experiment served as a baseline for comparing the results of transfer learning conducted in the following experiments. In the second experiment, Experiment II, MARBERT was first fine-tuned on ArECTD. Then transfer learning was performed by freezing the trainable layers to prevent any updates to the weights and biases and fine-tuning on the target datasets. In the third experiment, Experiment III, MARBERT was first fine-tuned on ArECTD, and transfer learning was carried out by fine-tuning the model on the target datasets without freezing any layers. In the three experiments, the target datasets were tested by varying the amount of training data, ranging from as few as 100 records to the entire dataset. The output models were evaluated based on accuracy, precision, recall, and weighted F-score metrics.

Table 2 Hyperparameters for AraBERT and MARBERT

Full size table

5 Results

5.1 Linguistic analysis of ArECTD

The top five significant words for each emotion (shown in Table 3) were determined by calculating the z-scored log odds ratios [20]. The significant words showed a sensible association with the emotion class and evidenced the quality of the annotation approach. It was observed that the emotion labels “sympathy", “hope" and “sarcasm" have high association scores compared to the emotions: “surprise", “joy", and “love"; whereas “anger", “fear", “sadness" and “none" have moderate association rations. These results suggest that some emotions are more verbally implicit and may require more context to be interpreted, annotated, and analyzed.

Table 3 Top five words associated with each emotion. The rounded z-scored log odds ratios in the parentheses indicate the significance of the association

Full size table

5.2 Content analysis of ArECTD

Content analysis of the ArCETD showed that the communication about COVID-19 culminated in April 2020, and the domination of the negative fear emotion tweets (11.7%) was closely followed by hope (11.2%) and sarcasm (10.9%) during that month. Sadness, hope, and anger emotions were the top three dominant emotions during the first year of the pandemic. The top discussed topics for the dominant sadness, hope, and anger were derived as described in Sect. 4 and are listed in Tables 4, 5, and 6.

Table 4 Topics of ‘sadness’ emotion

Full size table

Table 5 Topics of ‘hope’ emotion

Full size table

Table 6 Topics of ‘anger’ emotion

Full size table

A monthly breakdown in Table 7, shows that either hope or sarcasm came second to the fear, anger, or sadness emotions. We believe this is a reflection of the Egyptian culture which is rather optimistic and uses humor and sarcasm to express resistance and resilience in the face of tough times. Such cultural traits especially the prevalence of hope could be leveraged by the government to communicate important COVID-19 information to the public.

Table 7 The dominant emotions in each month in the period from the 1\(^{\hbox {st}}\) of January 2020 till 30\(^{\hbox {th}}\) of May 2021 after excluding the neutral tweets

Full size table

5.3 ArECTD classification model using deep learning

Applying the pipeline in [15] on the ArECTD dataset, and using AraBERT for embedding and classification resulted in an accuracy of 70.01%, 70.5% precision, 68.9% recall, and 71.7% F-score. Table 8 summarizes the performance of the model on ArECTD. The model obtained the highest F-score for the sadness (0.71) and anger (0.71) emotions. On the other hand, the lowest F-score was obtained for sympathy (0.65) and fear (0.68) emotions.

Replacing AraBERT with MARBERT has improved the overall results achieving 72.5% accuracy, 72.1% precision, 73.0% recall, and 72.8% F-score. We attribute the improved results to the fact that MARBERT was trained on dialectal Arabic. Table 9 shows the performance evaluation for each emotion. The model obtained the highest F-score for sympathy (0.93) and sadness (0.89), which are notably higher than their AraBERT counterparts. The F-score for anger (0.77) was also higher than its AraBERT counterpart. On the other hand, the model obtained the lowest F-score for surprise (0.62) and love (0.64), which were less than their AraBERT counterparts (0.69, and 0.70 respectively).

Table 8 Emotions performance evaluation for AraBERT model

Full size table

Table 9 Emotion performance evaluation for MARBERT model

Full size table

5.4 Transfer learning experiments

Table 10 and Fig. 4 demonstrate the F-score results for transfer learning on Dataset 1, while Table 11 and Fig. 5 demonstrate the F-score results for transfer learning on Dataset 2.

Figure 4 illustrates the comparison of the three experiments in terms of F-score performance on dataset 1. For dataset sizes up to 120 records, transfer learning with the layers frozen (Experiment II) outperforms fine-tuning without transfer learning (Experiment I). However, for the sizes that exceed 120 records, the F-score starts to decline slightly (between 0.09% to 0.23%). In dataset sizes below 400 records, transfer learning without freezing the layers (Experiment III) yields higher results compared to fine-tuning (Experiment I), with an F-score increase of up to 0.25%. For dataset sizes exceeding 1000 records, the performance of Experiment III and Experiment I becomes comparable, while Experiment II exhibits the lowest performance.

A similar trend can be observed in the results depicted in Fig. 5 which illustrates the comparison of the three experiments in terms of F-score performance on dataset 2. For dataset sizes below 400 records, Experiment III achieves higher results than Experiment I, with an F-score increase of up to 0.22%. For dataset sizes above 500 records, Experiment III and Experiment I demonstrate comparable performance. Experiment II consistently yields the poorest performance across all dataset sizes.

Table 10 Transfer learning experiments on dataset 1 [14]

Full size table

Table 11 Transfer learning experiments on dataset 2 [26]

Full size table

6 Discussion

Overall, the linguistic analysis demonstrated the quality of the annotated dataset. Using content analysis in the case study demonstrated that meaningful insights were drawn from the dataset by identifying the dominant emotions as linked to the trending topics and government decisions. The decisions related to disruptions of the regular daily activities were met with fear such as the case for closing schools, mosques, churches, and gardens as well as the suspension of international flights. These precautionary decisions were necessary and unavoidable. The fear reaction might be handled in the future by communicating the temporary nature of the decisions, the success of those policies to combat pandemics in the past, and raising awareness about the importance of abiding by social distancing decisions. The rumors about the low effectiveness of the Chinese vaccines that were approved for emergency use in Egypt triggered anger on social media, where people requested availing the American vaccines instead. One way to respond to the angry public could have been by explaining the scientific procedure behind creating different types of vaccines and that all vaccines would provide a certain level of protection against COVID-19 complications and help reduce hospitalizations and death rates. Further, the lack of clarity about the mid-term and end-of-semester college exam times has elicited anger reactions. Highlighting the contingency plan at the beginning of the academic year might have helped the students and parents plan better for disruptions during the year.

A comparison of the AraBERT and MARBERT results (Fig. 6) showed that MARBERT outperformed AraBERT for five emotions (sarcasm, sadness, anger, fear, and sympathy), and had a comparable F-score for joy and hope emotions. AraBERT yielded a better performance for the love and surprise emotions as well as the neutral category (none). The results suggest the superiority of MARBERT in the emotion classification of ArECTD most likely due to the fact that dialectical Arabic was used to train MARBERT. Further investigation is needed to understand the variability in the performance between AraBERT and MARBERT models across emotion classes.

The transfer learning results suggest that ArECTD model generalizes well to other emotion and sentiment classification tasks, especially where training data is rather limited. Fine-tuning ArECTD pre-trained model on a smaller dataset (10K) yielded a better performance for very small training sizes and performed almost similarly to training MARBERT on the target dataset in case of large sizes. This suggests that ArECTD pre-trained model can be used for smaller datasets using NoFreeze technique without significant loss of performance. Moreover, this pre-trained model will save time and resources by eliminating the need to train deep-learning transformer models such as MARBERT from scratch.

6.1 Limitation of the study

One limitation of this study is that the ArECTD dataset was collected only during the first year of the COVID-19 pandemic in Egypt. Therefore, the emotions and sentiments expressed in the dataset may not be representative of the current state of public opinion in Egypt towards COVID-19. Additionally, the study only focused on Egyptian Arabic, and the results may not generalize to other Arabic dialects or languages. Furthermore, the study only compared the performance of two deep learning models, MARBERT and AraBERT, for emotion classification tasks. While the results showed that MARBERT outperformed AraBERT for most emotions, it is possible that other models or approaches may yield better performance.

7 Conclusion and future work

Applications that use emotion detection require high-quality annotation in order to distinguish between emotions of fine granularity. A prominent example is distinguishing between emotions such as anger, fear, and sadness which might lead to the early discovery of depression in Arabic text. Semi-supervised self-learning annotation techniques hold great promise as practical and economic approaches that could increase the number of annotated Arabic emotion datasets. This paper contributes an approach to examine the quality of the semi-supervised self-learning annotation in a large Arabic emotion dataset (ArECTD) that was annotated using 10 emotions including neutral. The linguistic quality of the annotation approach was studied by demonstrating the lexical correlation between the words in each of the ten emotion classes. Further, a detailed content analysis of ArECTD evidenced that the annotation approach was reliable in understanding the public emotional reactions during the first year of COVID-19. These results indicate the potential of using semi-supervised self-learning approaches to label Arabic emotion datasets.

Further experiments were conducted to develop two deep-learning classification models for ArECTD achieving accuracies of 72.5% (MARBERT) and 70.01% (AraBERT). Transfer learning from the model developed using MARBERT was effective in the case of smaller datasets in the domains of emotion and sentiment analysis.

Future work should extend existing efforts by exploting the case of using multiple emotion labels and examining the classifiers’ performance using Explainable AI techniques [28]. Additionally, more transformer models such as RoBERTa, GPT, and AraT5 can be tested and compared with MARBERT results. In addition, transfer learning can be employed in the annotation process starting from small subset of labeled data by fine-tuning a pre-trained model with labeled data, extracting features from the unlabeled data to make annotations. Further, it would be interesting to compare the findings of this study with similar studies conducted in other languages to see how emotions toward COVID-19 may vary across different cultures and languages.

Data availability

The datasets generated and/or analyzed during the current study are available in https://huggingface.co/datasets/emotone_ar, https://github.com/mahmoudnabil/ASTD and https://github.com/Ahmed-elsayed-mahmoud/ArECTD

References

Ghadah Alqahtani, Abdulrahman Alothaim (2022) Emotion analysis of arabic tweets: language models and available resources. Front Artif Intell. https://doi.org/10.3389/frai.2022.843038
Article Google Scholar
Baali Massa, Ghneim Nada (2019) Emotion analysis of Arabic tweets using deep learning approach. J Big Data 6:10. https://doi.org/10.1186/s40537-019-0252-x
Article Google Scholar
Azam Nazish, Tahir Bilal, Mehmood Muhammad Amir (2020) Sentiment and emotion analysis of text: a survey on approaches and resources. Lan Technol 87
Kołakowska Agata, Landowska Agnieszka, Szwoch Mariusz, Szwoch Wioleta, Wróbel Michał (2015) Modeling emotions for affect-aware applications. In: Stanislaw Wrycza (ed) Information Systems Development and Applications. Faculty of Management University of Gdańsk, Poland, pp 55–67
Ekman Paul (1992) An argument for basic emotions. Cogn Emot 6(3–4):169–200
Article Google Scholar
Bakker Iris, Van Der Voordt Theo, Vink Peter, De Boon Jan (2014) Pleasure, arousal, dominance: mehrabian and russell revisited. Current Psychol 33:405–421
Article Google Scholar
Plutchik Robert (1982) A psychoevolutionary theory of emotions
Internet World Stats. Internet world users by language, 2023. https://www.internetworldstats.com/stats7.html
Mazen El-Masri, Nabeela Berardinelli, Hanady Ahmed (2017) Successes and challenges of arabic sentiment analysis research: a literature review. Soc Netw Anal Min 7(22):10. https://doi.org/10.1007/s13278-017-0474-x
Article Google Scholar
El-Sayed Ahmed, Lazem Shaimaa, Abougabal Mohamed (2021) An Arabic Egyptian Dialect COVID-19 Twitter Dataset (ArECTD). 9th International Japan-Africa Conference on Electronics, Communications, and Computations (JAC-ECC), 179–182. https://doi.org/10.1109/JAC-ECC54461.2021.9691451
Ali Al-Laith, Muhammad Shahbaz, Alaskar Hind F, Asim Rehmat (2021) Arasencorpus: a semi-supervised approach for sentiment annotation of a large Arabic text corpus. Appl Sci. https://doi.org/10.3390/app11052434
Article Google Scholar
Abdul-Mageed Muhammad, Elmadany AbdelRahim, Nagoudi ElMoatez Billah (2021) ARBERT & MARBERT: Deep bidirectional transformers for Arabic. In 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, 7088–7105, 01 . https://doi.org/10.18653/v1/2021.acl-long.551
Devlin Jacob, Chang Ming-Wei, Lee Kenton, Toutanova Kristina (2019) BERT: pre-training of deep bidirectional transformers for language understanding. Association for Computational Linguistics. https://doi.org/10.18653/v1/n19-1423
Al-Khatib Amr, El-Beltagy Samhaa R (2017) Emotional tone detection in Arabic tweets. In CICLing. https://doi.org/10.1007/978-3-319-77116-8_8
Article Google Scholar
El-Sayed Ahmed, Lazem Shaimaa, Abougabal Mohamed (2021) An Improved Emotion-based Analysis of Arabic Twitter Data using Deep Learning. 9th International Japan-Africa Conference on Electronics, Communications, and Computations (JAC-ECC), 175–178 https://doi.org/10.1109/JAC-ECC54461.2021.9691416
Mohamed Alaa, Rady Sherine, Gharib Tarek (2022) An ensemble deep learning approach for emotion detection in arabic tweets. International Journal of Advanced Computer Science and Applications, 13: 01 https://doi.org/10.14569/IJACSA.2022.01304112
Al-Laith Ali, Alenezi Mamdouh (2021) Monitoring people’s emotions and symptoms from Arabic tweets during the covid-19 pandemic. Information 12(2):86. https://doi.org/10.3390/info12020086
Article Google Scholar
Omara Eslam, Mosa Mervat, Ismail Nabil (2019) Emotion analysis in arabic language applying transfer learning. 15th International Computer Engineering Conference (ICENCO), 204–209. https://doi.org/10.1109/ICENCO48310.2019.9027295
Mohammad Saif, Bravo-Marquez Felipe, Salameh Mohammad, Kiritchenko Svetlana (2018) SemEval-2018 task 1: Affect in tweets. In Proceedings of The 12th International Workshop on Semantic Evaluation, 1–17, New Orleans, Louisiana, June 2018. Association for Computational Linguistics. https://doi.org/10.18653/v1/S18-1001
Demszky Dorottya, Movshovitz-Attias Dana, Ko Jeongwoo, Cowen Alan, Nemade Gaurav, Ravi Sujith (2020) Goemotions: A dataset of fine-grained emotions. 58th Annual Meeting of the Association for Computational Linguistics, 4040–4054. https://doi.org/10.18653/v1/2020.acl-main.372
Taspinar Ahmet (2023) Github: Twitterscraper. https://github.com/taspinar/twitterscraper. Accessed 15 April
Monroe Burt, Colaresi Michael, Quinn Kevin (2009) Fightin’ words: lexical feature selection and evaluation for identifying the content of political conflict. Political Anal 16:08. https://doi.org/10.1093/pan/mpn018
Article Google Scholar
Egypt Today. Egypt News, 2020. https://www.egypttoday.com/Tag/4008/Coronavirus. Accessed 15 April 2023
Office of the Comptroller of the Currency (O.C.C) (2020) Comptroller’s Handbook: sampling Methodologies. 1.0 edn. Comptroller of the Currency, Washington
Bensoltane Rajae, Zaki Taher (2021) Towards arabic aspect-based sentiment analysis: a transfer learning-based approach. Soc Netw Anal Min 12(1):7. https://doi.org/10.1007/s13278-021-00794-4
Article Google Scholar
Nabil Mahmoud, Aly Mohamed, Atiya Amir (2015) ASTD: Arabic sentiment tweets dataset. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, 2515–2519, Lisbon, Portugal. Association for Computational Linguistics. https://doi.org/10.18653/v1/D15-1299
Antoun Wissam, Baly Fady, Hajj Hazem (2020) AraBERT: transformer-based model for Arabic language understanding. In LREC 2020 Workshop Language Resources and Evaluation Conference 11–16 May 2020, 05
Abdelwahab Youmna, Kholief Mohamed, Sedky Ahmed Ahmed Hesham (2022) Justifying arabic text sentiment analysis using explainable ai (xai): Lasik surgeries case study. Information 13(11):536
Article Google Scholar

Download references

Acknowledgements

Not Applicable

Funding

Open access funding provided by The Science, Technology & Innovation Funding Authority (STDF) in cooperation with The Egyptian Knowledge Bank (EKB). No funds, grants, or other support were received.

Author information

Authors and Affiliations

Computer and Systems Engineering Department, Faculty of Engineering, Alexandria University, Alexandria, Egypt
Ahmed El-Sayed & Mohamed Abougabal
City of Scientific Research and Technological Applications, New Borg El-Arab, Egypt
Shaimaa Lazem

Authors

Ahmed El-Sayed
View author publications
You can also search for this author in PubMed Google Scholar
Mohamed Abougabal
View author publications
You can also search for this author in PubMed Google Scholar
Shaimaa Lazem
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

AE developed the pipeline, performed the experiments, interpreted the results, and wrote the manuscript under the supervision of SL and MA. All authors made equal contributions to the conception and analysis of the work. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Ahmed El-Sayed.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Ethical approval

Not Applicable

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

El-Sayed, A., Abougabal, M. & Lazem, S. Benchmarking a large Twitter dataset for Arabic emotion analysis. SN Appl. Sci. 5, 224 (2023). https://doi.org/10.1007/s42452-023-05437-1

Download citation

Received: 18 February 2023
Accepted: 06 July 2023
Published: 29 July 2023
DOI: https://doi.org/10.1007/s42452-023-05437-1

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Benchmarking a large Twitter dataset for Arabic emotion analysis

Abstract

Article highlights

Similar content being viewed by others

Building Large Arabic Multi-domain Resources for Sentiment Analysis

Hotel Arabic-Reviews Dataset Construction for Sentiment Analysis Applications

MAC: An Open and Free Moroccan Arabic Corpus for Sentiment Analysis

1 Introduction

2 Related work

3 ArECTD dataset

4 Methodology

4.1 Examining the linguistic quality of the annotation

4.2 Examining ArECTD generalizability using transfer learning