1 Introduction

In recent years, the amount of user content available on the Internet has dramatically grown allowing people to debate and share their thoughts and feelings [1]. Emotions play a crucial role in human communication and decision-making processes. Social media platforms have become an important source for understanding and detecting public opinions using sentiment and emotion analysis [2].

Text-based emotion analysis [3] is regarded as an elaborate form of sentiment analysis, where emotions rather than the polarity or the sentiment of the content are identified and analyzed. Machine and deep learning algorithms are trained to categorize text into different emotions such as happiness, sadness, and anger. emotion analysis holds great potential in understanding and analyzing emotions expressed in textual data. Its importance is evident in applications such as understanding customer reviews, depression detection from social media posts, and monitoring political reactions on social media [2].

There are three main types of emotional state representation [4]: discrete, dimensional, and componential. Discrete models, such as Paul Ekman’s model [5], classify emotions into six basic categories: anger, disgust, fear, happiness, sadness, and surprise. Dimensional models represent emotions as points in a two-dimensional or three-dimensional space. The pleasure, arousal, and dominance (PAD) model [6] is widely recognized as the most famous three-dimensional emotion model. Componential models consider various factors that contribute to or influence an emotional state, with combinations of basic emotions forming more complex emotions. For example, Robert Plutchik’s model [7] includes eight basic emotions, and other emotions, such as love and trust, are considered combinations of these basic emotions.

The state-of-the-art accuracies of emotion classification tasks in the Arabic language are low compared to English [1]. This is despite the fact that Arabic is the fourth language used on the Internet and social media [8]. One important reason for the lag in the research of Arabic language processing despite the plethora of available social media data is the cost of data annotation. Emotion-annotated Arabic datasets are scarce [1, 9]. The cost of the annotation process is a major hindrance to the generation of large emotion datasets and subsequently using emotion detection in real-life Arabic applications. More and larger datasets are required to advance the research and applications in this domain.

Recent efforts by El-Sayed et al. [10] and Al-Laith et al. [11] have used semi-supervised self-learning techniques to annotate large Arabic datasets. This approach utilizes a manually annotated subset of data to automatically label the remaining unlabeled part of the dataset. The approach holds great potential due to its low-cost compared to fully manual annotation. Nonetheless, examining the quality of the semi-supervised self-learning annotation approaches is necessary especially if the target applications attempt to distinguish between fine-granularity emotions such as sadness and anger such as in the case of depression detection applications. To the best of our knowledge, none of the research efforts that used this annotation approach in annotating Arabic datasets have provided details on the quality of the annotated data.

This paper aims to fill this gap by examining the quality of a large Arabic dataset, ArECTD, that was annotated using a semi-supervised self-learning approach [10]. First, the linguistic quality of the annotation is evaluated. Second, transfer learning has been used to study the generalization from ArECTD model to other Arabic datasets. Transfer learning is another practical alternative for researchers who only have access to small annotated Arabic datasets, whereby they could leverage the power of a large specialized pre-trained emotion model instead of training a model from scratch on a small dataset. Our study is evidence for the quality of the semi-supervised self-learning annotation approach.

The remainder of this paper is structured in the following manner. Section 2 reviews the related work. An overview of the ArECTD dataset and the annotation process is presented in Sect. 3. The methodology for benchmarking the ArECTD dataset is illustrated in Sect. 4. The results are depicted in Sect. 5, and discussed in Sect. 6. Finally, we conclude and discuss possible future directions in Sect. 7.

2 Related work

This section reviews the previous work on Arabic emotion analysis from text, summarizing and comparing the different available emotion datasets.

Alqahtani and Alothaim [1] provided a comprehensive survey of emotion detection in Arabic data. The available models, resources, datasets, and tools for Arabic emotion classification of tweets, were presented. Further, the survey discussed the available pre-trained language models that can be fine-tuned for the emotion detection task. One of the reported challenges was the limited dialect resources or the datasets from Arab countries with certain dialects. Develo** transformer models that recognize local Arabic dialects would have a positive impact on improving real-life applications of Arabic language models.

Abdul-Mageed et al. [12] proposed two new Arabic-specific BERT models, AraBERT and MARBERT. BERT [13], short for Bidirectional Encoder Representations from Transformers, is a transformer-based language model developed by Google. It is pre-trained on a large corpus of text and can be fine-tuned for various Natural Language Processing (NLP) tasks. AraBERT is a BERT-base architecture that was trained using 61GB of Arabic text. A limited amount of the training data for AraBERT is in the Egyptian dialect, with most of it being in Modern Standard Arabic (MSA). MARBERT is another BERT-base architecture that aimed at increasing the capacity of transformer models to handle dialectal Arabic. The training data was enriched by adding a set of 1 Billion Arabic tweets. The final training dataset had around 128 GB of text, with tweets making up over half of it. MARBERT has been evaluated in sentiment analysis, social meaning prediction, topic categorization, dialect identification, and named entity recognition, among other NLP tasks.

One of the earlier works on Arabic emotion analysis is the research by Al-Khatib and El-Beltagy [14] who collected an Arabic dialectical dataset using the “Olympics" hashtags in the period between Jul 2016 and Aug 2016 originated from Egypt. It consisted of 10K tweets and was manually annotated into eight emotions: sympathy, joy, love, anger, sadness, anger, fear, surprise, and none. Machine learning classifiers were tested on this dataset, and the best results were achieved by the Complement Naive Bayes classifier with an overall accuracy of 68.12%.

El-Sayed et al. [15] further proposed an improved pipeline to the one proposed in [14] using additional preprocessing techniques to handle emojis and experimenting with multiple word embeddings and comparing the performance of machine learning and deep learning models. Using AraBERT word embedding and AraBERT deep learning model outperformed the tested techniques with an accuracy of 75.8%. Further, a weighted voting ensemble of Logistic Regression, SVM, AraBERT, and GRU achieved an accuracy of 75.88%.

Alaa Mansy et al. [16] proposed an ensemble deep learning approach for emotion detection in a small dataset. The proposed model tested three deep learning models: Bi-LSTM, Bi-GRU, and MARBERT. The experiments were evaluated using the SemEval-2018-Task1-Ar-Ec dataset, which was collected using emotion words, and annotated using 11 class labels (anger, anticipation, disgust, fear, joy, love, optimism, pessimism, sadness, surprise, and trust). It was split into 2,278 tweets for training, 585 tweets for development, and 1,518 tweets for testing. Grid search was used to determine the best threshold and weights for the ensemble model. The best threshold was 0.34 and the best weights were 0.72, 0.1, and 0.18 for MARBERT, BiLSTM, and BiGRU respectively. The experiments achieved macro F-scores of 0.692, 0.653, and 0.664 for MARBERT, BiLSTM, and BiGRU respectively. The ensemble model outperformed the individual models achieving a 0.701 F-score.

Al-Laith and Alenezi [17], collected 5.5 million Arabic tweets on the COVID-19 topic in the period from January to August 2020. This large dataset was annotated using combined rule-based and neural network automatic approaches. LSTM was used to classify all of the tweets into six emotions (anger, disgust, fear, joy, sadness, and surprise) and two tweet types (symptom and non-symptom tweets) achieving an 83% F1 score for emotion classification and 75% F1-score for symptom classification.

Al-Laith et al. [11] introduced a semi-supervised self-learning method aimed at expanding an Arabic sentiment annotated corpus using unlabeled data. The technique involved training a neural network with a set of models on a manually labeled dataset of 15,000 tweets. These models were then utilized to extend the corpus into a much larger Arabic sentiment corpus which consisted of 4.5 million tweets. To train and evaluate the final corpus, a long-short-term memory (LSTM) deep learning classifier was employed achieving an accuracy of 70%.

El-Sayed et al. [10] collected an Egyptian Arabic COVID-19 Twitter dataset, in the period from January 2020 to May 2021. The dataset contained approximately 78,870 tweets and has been annotated using a hybrid approach that combined manual and semi-supervised methods. The annotation process involved assigning ten commonly observed emotion labels to the tweets, namely sarcasm, sadness, anger, fear, sympathy, joy, hope, surprise, love, and none.

Omara et al. [18] proposed using a transfer learning approach to mitigate the lack of Arabic emotion-annotated datasets. Using transfer learning, the learned knowledge in a sentiment analysis task was used to conduct emotion classification under the hypothesis that both domains share a similar feature space. The sentiment dataset consisted of both MSA and dialectal Arabic. The emotion dataset was constructed from two Arabic datasets. The first was collected in [14], while the second was used in the SemEval-2018 task [19]. Emotion classification with CNN and transfer learning achieved an accuracy of 95.24% compared to the Naive Bayes model (68.12%).

Transfer learning in emotion detection tasks has been further examined in other languages. Demszky et al. [20] introduced a large English manually annotated dataset, GoEmotions, consisting of 58k Reddit comments over 27 emotion categories including a neutral label. BERT-base model was fine-tuned on the dataset achieving F1-score of 0.46. Transfer learning experiments were conducted with existing emotion benchmarks to show that the dataset generalizes well to other domains and different emotion taxonomies. Three different fine-tuning setups, namely Baseline, Freeze, and NoFreeze were conducted. In the Baseline setup, BERT was fine-tuned only on the target dataset. In the Freeze setup, BERT was fine-tuned on GoEmotions, then transfer learning was performed by replacing the final dense layer, freezing all layers besides the last layer, and fine-tuning on the target dataset. The NoFreeze setup is the same as Freeze, except they did not freeze the bottom layers.

Table 1 summarizes and compares the details of the Arabic emotion datasets in the literature. A close study of the table shows the small number of available emotion datasets. Manually annotated datasets are small in size just under 10K of tweets. The larger datasets [10, 11, 17] were not labeled manually. However, the quality of the annotation was not studied qualitatively. It is important to evaluate the quality of non-manual annotation approaches especially if the models are used in real-life emotion detection applications to detect high granularity emotions. Related work additionally shows that transfer learning is a possible alternative to handle cases where the training data is limited in size. The proposed research attempts to examine the quality of a semi-automatically annotated dataset and to explore the effectiveness of transfer learning from the model generated using this dataset to smaller datasets. We take ArECTD as a case study as the only Arabic dataset that is labeled for emotion detection using a semi-supervised self-learning technique.

Table 1 Summary of Arabic Emotion Datasets

3 ArECTD dataset

ArECTD is a large-scale Arabic Egyptian dialect dataset [10] that was collected from Twitter in the period from the 1\(^{\hbox {st}}\) of January 2020 till the 30\(^{\hbox {th}}\) of May 2021 focusing on COVID-19 Egyptian dialect tweets. TwitterScraper [21] was used to obtain 1,597,939 tweets using COVID-19 hashtags, which were then cleaned by filtering out non-Arabic tweets, tweets from news pages, and non-Egyptian tweets. The cleaned version of the dataset contained 78,870 tweets authored by 56,176 unique users. The tweets of ArECTD were classified into ten labels: sadness, fear, sarcasm, sympathy, anger, surprise, love, joy, hope, and none (neutral).

The dataset was annotated using a combination of manual annotation and a semi-supervised self-learning technique. Around 11,000 tweets were manually annotated by a group of Computer Science undergraduate students. The self-learning process, shown in Fig. 1, used three different classifiers: Logistic Regression, AraBERT, and GRU which were trained on the manually annotated dataset. Each unlabeled tweet was processed by the three classifiers. If the tweet had a prediction probability greater than or equal to a threshold of 0.8 for a certain emotion by the three classifiers, it was assigned to this emotion label, removed from the unlabeled set, and added to the training set. The process was repeated until the unlabeled set was empty. The final distribution of emotions of the dataset after the annotation is shown in Fig. 2.

Fig. 1
figure 1

Semi-supervised self-learning technique used in ArECTD annotation. Replicated from [10]

Fig. 2
figure 2

ArECTD distribution over the ten emotions

4 Methodology

The methodology used to examine the quality of annotation technique is illustrated in the following subsections.

4.1 Examining the linguistic quality of the annotation

The linguistic and content quality of the annotation of (ArECTD) [10] were studied using two steps. First, using the approach proposed in [20], the words that are highly correlated with a certain emotion class in contrast to all other classes were identified using the log odds ratio of all words emotions [22]. The words were inspected by the authors to flag the words that do not conform with the emotion label.

Second, the content of the dataset was examined using a manual comprehensive inspection of the dataset as follows. The process started by identifying and collecting the list of trending topics during the time of dataset collection with help from the website [23]. The tweets in the dataset were manually inspected, and where possible were assigned to the trending topics using judgment sampling. Judgment sampling [24] is a non-probability sampling technique that involves selecting samples based on the researcher’s judgment or expertise. In this sampling technique, the researchers select samples that they believe are representative of the population or which possess the desired characteristics for the study. We leveraged our local knowledge and expertise as well as the date of the tweets to judge whether a tweet should belong to a certain trending topic. Each topic had a number of emotional responses represented by the relevant tweets. We identified the dominant emotion for each of the trending topics and checked them for plausibility (i.e., the emotion was a reasonable response to the topic, for example, it would be odd to have love or joy as a response to a topic such as a lockdown). By examining the distinctive words in each emotion class and the tweets of the dominant emotions in the trending topics, we were able to evidence the quality of the annotation approach.

4.2 Examining ArECTD generalizability using transfer learning

Transfer deep learning is a methodology by which the knowledge that has been learned in a deep learning model from a source domain can be reused in a different but related target domain [18]. Recently, deep transfer learning models have attained state-of-the-art outcomes due to the fact that a deep learning model can learn more domain-invariant features [25]. These models have already been pre-trained on a huge unlabeled corpus, so they can be fine-tuned on available small datasets of labeled data for different NLP applications [25].

Given the scarcity of large Arabic emotion datasets such as ArECTD, we examined how the deep learning models developed for ArECTD might be used in sentiment and emotion analysis tasks where the labeled data in that target domain is limited. Based on the results from the literature [10, 12, 14,15,16,17,18] we hypothesized that transferring knowledge from ArECTD models to the smaller dataset would increase model accuracy for the target domain. Transfer learning experiments were conducted on two target datasets. Dataset 1 [14] contains about 10K Arabic tweets, mostly using Egyptian dialect, which was annotated into eight categories. Namely, sympathy, joy, love, anger, sadness, anger, fear, surprise, and none. Dataset 2 [26] is an Arabic sentiment analysis dataset. It consists of about 10K tweets, which were annotated using four labels: objective, subjective positive, subjective negative, and subjective mixed. Objective refers to neutral statements without personal opinions, while subjective positive indicates positive sentiments or opinions. Subjective negative is used for negative sentiments or opinions, and subjective mixed represents a combination of both positive and negative sentiments.

To develop the ArECTD transfer learning model, a comparison between two state-of-the-art Arabic deep learning transformer models, AraBERT [27] and MARBERT [12] were employed in the pipeline described in [15] and illustrated in Fig. 3. The models were evaluated based on accuracy, precision, recall, and weighted F-score metrics. The dataset was split into 60% training, 20% validation, and 20% testing. The hyper-parameters for AraBERT and MARBERT models were tuned, and the specific values used in the experiments are shown in Table 2. The results indicated that MARBERT outperformed AraBERT, leading us to select MARBERT for subsequent transfer learning experiments.

Subsequently, three transfer learning experiments were conducted on the target datasets. In the first experiment, Experiment I, MARBERT was fine-tuned on each of the target datasets independently. This experiment served as a baseline for comparing the results of transfer learning conducted in the following experiments. In the second experiment, Experiment II, MARBERT was first fine-tuned on ArECTD. Then transfer learning was performed by freezing the trainable layers to prevent any updates to the weights and biases and fine-tuning on the target datasets. In the third experiment, Experiment III, MARBERT was first fine-tuned on ArECTD, and transfer learning was carried out by fine-tuning the model on the target datasets without freezing any layers. In the three experiments, the target datasets were tested by varying the amount of training data, ranging from as few as 100 records to the entire dataset. The output models were evaluated based on accuracy, precision, recall, and weighted F-score metrics.

Fig. 3
figure 3

The general pipeline of Arabic emotion analysis

Table 2 Hyperparameters for AraBERT and MARBERT

5 Results

5.1 Linguistic analysis of ArECTD

The top five significant words for each emotion (shown in Table 3) were determined by calculating the z-scored log odds ratios [20]. The significant words showed a sensible association with the emotion class and evidenced the quality of the annotation approach. It was observed that the emotion labels “sympathy", “hope" and “sarcasm" have high association scores compared to the emotions: “surprise", “joy", and “love"; whereas “anger", “fear", “sadness" and “none" have moderate association rations. These results suggest that some emotions are more verbally implicit and may require more context to be interpreted, annotated, and analyzed.

Table 3 Top five words associated with each emotion. The rounded z-scored log odds ratios in the parentheses indicate the significance of the association

5.2 Content analysis of ArECTD

Content analysis of the ArCETD showed that the communication about COVID-19 culminated in April 2020, and the domination of the negative fear emotion tweets (11.7%) was closely followed by hope (11.2%) and sarcasm (10.9%) during that month. Sadness, hope, and anger emotions were the top three dominant emotions during the first year of the pandemic. The top discussed topics for the dominant sadness, hope, and anger were derived as described in Sect. 4 and are listed in Tables 4, 5, and 6.

Table 4 Topics of ‘sadness’ emotion
Table 5 Topics of ‘hope’ emotion
Table 6 Topics of ‘anger’ emotion

A monthly breakdown in Table 7, shows that either hope or sarcasm came second to the fear, anger, or sadness emotions. We believe this is a reflection of the Egyptian culture which is rather optimistic and uses humor and sarcasm to express resistance and resilience in the face of tough times. Such cultural traits especially the prevalence of hope could be leveraged by the government to communicate important COVID-19 information to the public.

Table 7 The dominant emotions in each month in the period from the 1\(^{\hbox {st}}\) of January 2020 till 30\(^{\hbox {th}}\) of May 2021 after excluding the neutral tweets

5.3 ArECTD classification model using deep learning

Applying the pipeline in [15] on the ArECTD dataset, and using AraBERT for embedding and classification resulted in an accuracy of 70.01%, 70.5% precision, 68.9% recall, and 71.7% F-score. Table 8 summarizes the performance of the model on ArECTD. The model obtained the highest F-score for the sadness (0.71) and anger (0.71) emotions. On the other hand, the lowest F-score was obtained for sympathy (0.65) and fear (0.68) emotions.

Replacing AraBERT with MARBERT has improved the overall results achieving 72.5% accuracy, 72.1% precision, 73.0% recall, and 72.8% F-score. We attribute the improved results to the fact that MARBERT was trained on dialectal Arabic. Table 9 shows the performance evaluation for each emotion. The model obtained the highest F-score for sympathy (0.93) and sadness (0.89), which are notably higher than their AraBERT counterparts. The F-score for anger (0.77) was also higher than its AraBERT counterpart. On the other hand, the model obtained the lowest F-score for surprise (0.62) and love (0.64), which were less than their AraBERT counterparts (0.69, and 0.70 respectively).

Table 8 Emotions performance evaluation for AraBERT model
Table 9 Emotion performance evaluation for MARBERT model

5.4 Transfer learning experiments

Table 10 and Fig. 4 demonstrate the F-score results for transfer learning on Dataset 1, while Table 11 and Fig. 5 demonstrate the F-score results for transfer learning on Dataset 2.

Figure 4 illustrates the comparison of the three experiments in terms of F-score performance on dataset 1. For dataset sizes up to 120 records, transfer learning with the layers frozen (Experiment II) outperforms fine-tuning without transfer learning (Experiment I). However, for the sizes that exceed 120 records, the F-score starts to decline slightly (between 0.09% to 0.23%). In dataset sizes below 400 records, transfer learning without freezing the layers (Experiment III) yields higher results compared to fine-tuning (Experiment I), with an F-score increase of up to 0.25%. For dataset sizes exceeding 1000 records, the performance of Experiment III and Experiment I becomes comparable, while Experiment II exhibits the lowest performance.

A similar trend can be observed in the results depicted in Fig. 5 which illustrates the comparison of the three experiments in terms of F-score performance on dataset 2. For dataset sizes below 400 records, Experiment III achieves higher results than Experiment I, with an F-score increase of up to 0.22%. For dataset sizes above 500 records, Experiment III and Experiment I demonstrate comparable performance. Experiment II consistently yields the poorest performance across all dataset sizes.

Table 10 Transfer learning experiments on dataset 1 [14]
Table 11 Transfer learning experiments on dataset 2 [26]
Fig. 4
figure 4

Transfer learning results for dataset 1 in terms of F-scores

Fig. 5
figure 5

Transfer learning results for dataset 2 in terms of F-scores

6 Discussion

Overall, the linguistic analysis demonstrated the quality of the annotated dataset. Using content analysis in the case study demonstrated that meaningful insights were drawn from the dataset by identifying the dominant emotions as linked to the trending topics and government decisions. The decisions related to disruptions of the regular daily activities were met with fear such as the case for closing schools, mosques, churches, and gardens as well as the suspension of international flights. These precautionary decisions were necessary and unavoidable. The fear reaction might be handled in the future by communicating the temporary nature of the decisions, the success of those policies to combat pandemics in the past, and raising awareness about the importance of abiding by social distancing decisions. The rumors about the low effectiveness of the Chinese vaccines that were approved for emergency use in Egypt triggered anger on social media, where people requested availing the American vaccines instead. One way to respond to the angry public could have been by explaining the scientific procedure behind creating different types of vaccines and that all vaccines would provide a certain level of protection against COVID-19 complications and help reduce hospitalizations and death rates. Further, the lack of clarity about the mid-term and end-of-semester college exam times has elicited anger reactions. Highlighting the contingency plan at the beginning of the academic year might have helped the students and parents plan better for disruptions during the year.

A comparison of the AraBERT and MARBERT results (Fig. 6) showed that MARBERT outperformed AraBERT for five emotions (sarcasm, sadness, anger, fear, and sympathy), and had a comparable F-score for joy and hope emotions. AraBERT yielded a better performance for the love and surprise emotions as well as the neutral category (none). The results suggest the superiority of MARBERT in the emotion classification of ArECTD most likely due to the fact that dialectical Arabic was used to train MARBERT. Further investigation is needed to understand the variability in the performance between AraBERT and MARBERT models across emotion classes.

The transfer learning results suggest that ArECTD model generalizes well to other emotion and sentiment classification tasks, especially where training data is rather limited. Fine-tuning ArECTD pre-trained model on a smaller dataset (10K) yielded a better performance for very small training sizes and performed almost similarly to training MARBERT on the target dataset in case of large sizes. This suggests that ArECTD pre-trained model can be used for smaller datasets using NoFreeze technique without significant loss of performance. Moreover, this pre-trained model will save time and resources by eliminating the need to train deep-learning transformer models such as MARBERT from scratch.

Fig. 6
figure 6

Comparison between F-score of AraBERT and MARBERT

6.1 Limitation of the study

One limitation of this study is that the ArECTD dataset was collected only during the first year of the COVID-19 pandemic in Egypt. Therefore, the emotions and sentiments expressed in the dataset may not be representative of the current state of public opinion in Egypt towards COVID-19. Additionally, the study only focused on Egyptian Arabic, and the results may not generalize to other Arabic dialects or languages. Furthermore, the study only compared the performance of two deep learning models, MARBERT and AraBERT, for emotion classification tasks. While the results showed that MARBERT outperformed AraBERT for most emotions, it is possible that other models or approaches may yield better performance.

7 Conclusion and future work

Applications that use emotion detection require high-quality annotation in order to distinguish between emotions of fine granularity. A prominent example is distinguishing between emotions such as anger, fear, and sadness which might lead to the early discovery of depression in Arabic text. Semi-supervised self-learning annotation techniques hold great promise as practical and economic approaches that could increase the number of annotated Arabic emotion datasets. This paper contributes an approach to examine the quality of the semi-supervised self-learning annotation in a large Arabic emotion dataset (ArECTD) that was annotated using 10 emotions including neutral. The linguistic quality of the annotation approach was studied by demonstrating the lexical correlation between the words in each of the ten emotion classes. Further, a detailed content analysis of ArECTD evidenced that the annotation approach was reliable in understanding the public emotional reactions during the first year of COVID-19. These results indicate the potential of using semi-supervised self-learning approaches to label Arabic emotion datasets.

Further experiments were conducted to develop two deep-learning classification models for ArECTD achieving accuracies of 72.5% (MARBERT) and 70.01% (AraBERT). Transfer learning from the model developed using MARBERT was effective in the case of smaller datasets in the domains of emotion and sentiment analysis.

Future work should extend existing efforts by exploting the case of using multiple emotion labels and examining the classifiers’ performance using Explainable AI techniques [28]. Additionally, more transformer models such as RoBERTa, GPT, and AraT5 can be tested and compared with MARBERT results. In addition, transfer learning can be employed in the annotation process starting from small subset of labeled data by fine-tuning a pre-trained model with labeled data, extracting features from the unlabeled data to make annotations. Further, it would be interesting to compare the findings of this study with similar studies conducted in other languages to see how emotions toward COVID-19 may vary across different cultures and languages.