Investigating fairness in machine learning-based audio sentiment analysis

Luitel, Sophina; Liu, Yang; Anwar, Mohd

doi:10.1007/s43681-024-00453-2

Investigating fairness in machine learning-based audio sentiment analysis

Original Research
Open access
Published: 25 March 2024

(2024)
Cite this article

Download PDF

You have full access to this open access article

AI and Ethics Aims and scope Submit manuscript

Investigating fairness in machine learning-based audio sentiment analysis

Download PDF

650 Accesses
Explore all metrics

Abstract

Audio sentiment analysis is a growing area of research, however little attention has been paid to the fairness of machine learning models in this field. Whilst the current literature covers research on machine learning models’ reliability and fairness in various demographic groups, fairness in audio sentiment analysis with respect to gender is still an uninvestigated field. To fill this knowledge gap, we conducted experiments aimed at assessing the fairness of machine learning algorithms concerning gender within the context of audio sentiment analysis. In this research, we used 442 audio files of happiness and sadness—representing equal samples of male and female subjects—and generated spectrograms for each file. Then we performed feature extraction using bag-of-visual-words method followed by building classifiers using Random Forest, Support Vector Machines, and K-nearest Neighbors algorithms. We investigated whether the machine learning models for audio sentiment analysis are fair across female and male genders. We found the need for gender-specific models for audio sentiment analysis instead of a gender-agnostic-model. Our results provided three pieces of evidence to back up our claim that gender-specific models demonstrate bias in terms of overall accuracy equality when tested using audio samples representing the other gender, as well as combination of both genders. Furthermore, gender-agnostic-model performs poorly in comparison to gender-specific models in classifying sentiments of both male and female audio samples. These findings emphasize the importance of employing an appropriate gender-specific model for an audio sentiment analysis task to ensure fairness and accuracy. The best performance is achieved when using a female-model (78% accuracy) and a male-model (74% accuracy), significantly outperforming the 66% accuracy of the gender-agnostic model.

When Siri Knows How You Feel: Study of Machine Learning in Automatic Sentiment Recognition from Human Speech

Emotion Recognition and Sentiment Analysis of Extemporaneous Speech Transcriptions in Russian

ShEMO: a large-scale validated database for Persian speech emotion detection

Article 08 October 2018

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Artificial intelligence (AI) has been used in many classification and recognition tasks including sentiment analysis. However, in the race to constantly build models with high accuracy, researchers have not always paid attention to uneven system capability development [1]. Increasingly, researchers are starting to realize that a high accuracy in standard benchmark datasets does not guarantee successful deployment in real-world scenarios [2,3,4]. There are many demographic factors such as gender, age, and ethnicity that play a role in real-world scenarios that standard benchmark dataset does not always provide. Due to this issue, many machine learning models are never deployed for production. Nonetheless, researchers are becoming aware of this issue, therefore, artificial intelligence’s reliability, credibility, and fairness have attracted people’s attention, especially for human-in-the-loop or human-centered AI. Recent research has raised concerns over fairness implications [5]. A growing number of researchers have realized that the “one-size-fits-all” approach does not fit at all in AI systems and are committed to creating and designing AI algorithms that are trustworthy [6], fair [7], and unbiased [8, 9]. Machine learning and AI tools are used to make human life easier with the desire that the solutions would do what humans would have done but are faster, more consistent, and unbiased. But the irony here is that those desired solutions are frequently observed to perform better for some demographic groups than for others, making biased judgments and escalating inequality [10, 11]. Hence, utilizing as much demographic data as possible could make the solution nearest to how humans would have done it themselves without bias.

The fairness of AI algorithms is an evolving and adapting research area that stems from the general need for decision-making to be free from bias and discrimination [12]. Schmitz et al. experimented with how to balance the accuracy and fairness of the multimodal architectures for emotion recognition and discovered that the fairest bimodal model is an audio + video fusion [13]. Ricci Lara et al. discussed the meaning of fairness in medical image processing and commented on potential sources of bias and available strategies to mitigate bias [14]. In order to explore the issue of fairness in dialogue systems, Liu et al. constructed a benchmark dataset and proposed quantitative measures to understand fairness in dialogue models. They found that popular dialogue models considerably prejudiced different genders and races and provided two methods to mitigate bias in dialogue systems [15]. In the context of current issues in healthcare, Chen et al. summarized the fairness in machine learning and its intersectional field, outlining how algorithmic biases arise in existing clinical workflows and the healthcare disparities that result from these issues [16]. Although research in [1, 16,17,18] show that AI algorithms can be biased against specific populations or groups in various situations, there is a gap in understanding fairness in audio sentiment analysis.

We used 442 audio files with male and female voices, transformed them into spectrograms, and used bag-of-visual-words feature representation and machine learning algorithms such as Random Forest (RF), Support Vector Machines (SVM) and K-nearest Neighbors (KNN) to investigate the fairness of the audio sentiment analysis models in terms of overall accuracy equality between female and male genders. We found that models generated using the same parameters and algorithms do not perform the same way for different genders’ audio files. Hence, using a gender-agnostic model to analyze the sentiments of different genders results in poor accuracy in audio sentiment analysis. In a previous study, a model was generated and tested using only female audio files obtaining an accuracy of 76% [19]. In this study, we generated a gender-agnostic-model using both male and female audio files together, but the accuracy decreased to 66%. We observed a notable difference in accuracy, prompting us to conduct experiments that revealed bias within the machine learning algorithms for audio sentiment analysis, particularly with respect to gender. Hence, we separated the audio files into male and female groups and built a gender-specific trained model. The female-model was trained using female audio datasets, while the male-model utilized male audio datasets. This approach aimed to address the poor accuracy of the gender-agnostic model. We then tested the accuracy of the gender-specific model against a gender-specific dataset and a dataset representing both genders. Even after performing hyperparameter optimization in the machine learning algorithms, the model trained with female audio files (female-model) which achieved the best accuracy of 78% did not perform well with male audio files attaining the accuracy of 57%. Similarly, the model trained for male audio files (male-model) that achieved the accuracy of 74% did not perform well with the female audio files attaining the accuracy of 60%. The observed differences in accuracy verify that we need a personalized approach based on gender to get better accuracy.

Our main contributions are the following:

We provided three pieces of evidence to highlight the need for addressing fairness concerns in machine learning models used in audio sentiment analysis tasks.
We offer a resolution to mitigate bias by constructing models for audio sentiment analysis that consider the demographic factor of gender.
We demonstrate the importance of considering the demographic factor of gender in audio sentiment analysis tasks, providing a valuable insight that could serve as a reference for future researchers.

The rest of the sections of this paper are organized as follows. Section 2 provides literature review for audio for sentiment analysis. Section 3 provides the methodology of the study followed by results in Sect. 4. Section 5 provides discussion and conclusion.

2 Related Work

Numerous studies investigate into audio sentiment analysis utilizing voices from both male and female speakers. However, there is a scarcity of subsequent experiments examining the potential impact of gender on the accuracy of a singular model employed for both genders [20,21,22]. Jia and SungChu reported an accuracy of 60.1% for the top-performing audio sentiment analysis model in multimodal sentiment analysis before fusion [20]. It is possible that accounting the demographic factor of gender and crafting a gender-specific model could have potentially yielded higher accuracy. Furthermore, the researchers substantiate our assertion regarding the significance of gender, despite their omission of explicit discussions on its impact within the models [21]. They employed three distinct datasets to assess the accuracy of their fully convolutional neural network models. Notably, the Toronto Emotional Speech Set (TESS) [23], an acted dataset comprising 2800 concise audio samples delivered by two actresses, achieved an accuracy rate of 99.03%. Conversely, another dataset, the Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS) [24], containing audio recordings from 24 individuals (12 male and 12 female), achieved an accuracy rate of 75.28%. These accuracy figures highlight a distinct trend: models trained using a singular gender’s voice consistently outperform those employing voices from both genders. Male and female voices tend to have different pitch ranges, tones, and resonance [25]. Consequently, employing identical techniques or models might not adequately capture the sentiment when analyzing voices from both female and male genders.

The authors explored the concept of fairness in machine learning algorithms, shedding light on the causes of algorithmic bias and unfairness, as well as common definitions and measures for fairness [26]. Additionally, they examined mechanisms designed to enhance fairness. Among the fairness measures discussed is overall accuracy equality, emphasizing the need for similar accuracy across diverse groups. In our experiment, attaining similar accuracy between male and female audio datasets using a model trained for a different gender proved to be challenging. Additionally, when employing a gender-agnostic model, the overall accuracy of the model was notably lower compared to gender-specific models, emphasizing the unfairness of the audio sentiment analysis models.

Research across diverse domains, including facial recognition, healthcare, business, and computer graphics extensively reports fairness issues. For instance, facial analysis methods reveal race and gender biases, leading to disproportionate misclassification of dark-skinned females [27]. Additionally, rendering algorithms in computer graphics predominantly favor light-skinned individuals [28]. Historical data biases in healthcare machine learning are evident [29], and despite pulse oximetry’s long-known reduced accuracy for dark skin tones [30], proper guidelines on addressing this bias in clinical practice are still lacking. Furthermore, Amazon uncovered that its machine learning hiring system exhibited bias against female candidates, especially in software development and technical roles. One potential explanation is the predominance of historical data primarily consisting of male software developers [31]. In a separate instance within the domain of advertising, it was revealed that Google’s ad-targeting algorithm tended to suggest higher-paying executive positions more frequently for men than for women [32]. These situations arise from training the model with historical data that has a higher concentration of data from one gender over the other. Such occurrences imply that the audio sentiment analysis domain might be subject to unfairness stemming from demographic factors in machine learning models. Our compelling evidence highlights gender bias, as the gender-agnostic model struggles to accurately capture sentiments from both genders, and gender-specific models perform poorly when applied to a different gender than their training data.

Several studies [33,34,35] addressed bias within text-based sentiment analysis. However, no research pertaining to bias was found within the domain of audio-based sentiment analysis. This represents an area where further exploration and discussion are warranted. This discrepancy stands out as a notable gap in the existing research landscape. To gain deeper insights into fairness within the machine learning community, we are investigating the concept of fairness in the context of machine learning-based audio sentiment analysis for gender-specific and gender-agnostic models.

3 Methodology

We employed the EmoFilm dataset [36]—a multilingual corpus comprising 1,115 emotional utterances in English, Spanish, and Italian that were taken from 43 films and 207 speakers (113 males and 94 females). There were 578 audio files labeled as male and 537 labeled as female. This research used the existing dataset and did not involve direct human participants. This dataset has never been used for sentiment analysis, with the combination of spectrograms and the bag-of-visual-words method. Each audio file’s label was included in the filename, making it very simple to identify the gender and emotion represented by the audio signal (i.e., fear, disgust, happiness, anger, sadness). For our binary sentiment classification goal, we have chosen 204 happiness and 238 sadness audio files equally divided among males and females.

We adopted the methodology outlined in a prior study [19], which involves generating spectrograms from audio files and subsequently creating histograms through bag-of-visual words. Later passed the histogram to classifiers such as RF, SVM, and KNN for sentiment classification. Figure 1 shows the pipeline of the experiment. All the steps were the same while develo** gender-specific models and gender-agnostic models; only the parameters differed. We did our experiment in two sections; first, we generated the histogram from audio files and later passed the histogram to classifiers. In the first part of the experiment, we transformed the audio files into a spectrogram using short-time Fourier transform (STFT) with the Librosa library in Python, where different sample rates were applied for different genders. The best combination of the parameters can be seen in the result section. Then, used the oriented FAST and rotated BRIEF (ORB) [37] algorithm to extract the keypoints and descriptors. Using the descriptors, we generated the histogram and saved it as a.csv file for later analysis. After transforming the raw audio dataset into the histogram.csv files, in the second part of the experiment, we generated multiple training and test datasets using a random 75:25 split ratio for gender-specific and gender-agnostic models. We used the training data to train our classifier models and then applied the testing data to extract sentiments from audio files, specifically those they were trained for. After achieving optimal accuracy for both gender-specific and gender-agnostic models, we conducted tests with a separate dataset to evaluate accuracy and examine gender bias in the audio sentiment analysis task. Hyperparameter optimization was done in many stages; we have listed the optimization that we performed:

1.
Generating spectrogram We used different sample rates (11,025 Hz, 22,050 Hz, and 44,100 Hz) while generating spectrograms to check which one works best for each gender in our experiment. Sample rate refers to the quantity of individual data points captured per second in audio recording. We aimed to investigate whether adjusting sample rates could enhance the accuracy rate. The default sample rate in the Python Librosa library, utilized for generating spectrograms, was 22,050 Hz. Consequently, we opted to modify the rate by both doubling and halving it. We found that a sample rate of 22,050 Hz works best for the female gender, and a sample rate of 11,025 Hz works best for the male gender.
2.
Building a visual dictionary In order to create a visual dictionary, Bag-of-Visual-Words techniques were used. The concept is adapted from information retrieval and NLP’s bag of words (BOW). Instead of counting words that appear in a document, in BOVW, we use image features (keypoints and descriptors) as words and make a frequency histogram. We used the ORB algorithm to extract keypoints from the spectrogram with 32-bit descriptors. We used a different number of keypoints (100, 150, 300) to test if we could improve the accuracy for male and female audio files. It was established that 150 keypoints for female audio files and 300 keypoints for male audio files work the best.

After extracting keypoints, we utilized K-means clustering to create multiple clusters (5, 10, 20) for each image to compare and get the best performance for models customized for females, males, and both. We found that 10 clusters worked best for both male and female audio files.

1.3.
Hyperparameter optimization algorithms We used hyperparameter optimization techniques using Randomized Search and Grid Search for the classifiers (RF, SVM, and KNN) to get the best parameters for building customized models for males and females. We were able to utilize the hyperparameter optimization to increase the models’ accuracy. We experimented with 243 scenarios to get the best hyperparameters for all three models (female-model, male-model, and gender-agnostic-model) using different sample rates, number of keypoints and clusters. Table 1 shows all the hyperparameters for the optimization of female, male, and gender-agnostic models’ accuracy.

Table 1 Hyperparameters for the optimization of the female, male and gender-agnostic models

Full size table

4 Results

We analyzed our results in two phases. Firstly, based on accuracy we established the best model for gender-agnostic-model, female-model, and male-model performing hyperparameter optimization. Secondly, we investigated whether the optimal model for a given gender would retain its accuracy when applied to the opposite gender, aiming to assess the algorithm’s gender-related bias.

Table 2 shows 27 experiment scenarios developed for male-model with a constant sample rate of 11,025 Hz. For instance, scenario # 5 involves a sample rate of 11,025 Hz, 100 keypoints, 10 clusters, and the SVM classification algorithm. We employed two additional sample rates, 22,050 Hz and 44,100 Hz. This resulted in a total of 81 distinct experiment scenarios constructed using male audio samples (3 different sample rates for each of the 27 variations). We employed the same methodology to derive optimal female-model from female audio samples, as well as gender-agnostic-model from audio samples of both genders. Hence, a total of 243 (= 3 × 81) scenarios were generated by employing three distinct models, each comprising 81 unique scenarios. In the interest of brevity, not all scenarios are presented within the table. Table 3 shows the combination of parameters that generated the best models for three types of audio samples (female, male, and both).

Table 2 Hyperparameter optimization for male-models

Full size table

Table 3 Best parameters for each gender-specific model and gender-agnostic model

Full size table

Table 4 demonstrates different classifiers’ accuracy, macro average precision, and recall for all three models used. SVM classifier had the accuracy of 72.5% (precision 73.06%, recall 73.96%) for female-model, 72% (precision 72.50%, recall 71.63%) for male-model, and 56% (precision 59.23%, recall 56.54%) for gender-agnostic-model. KNN classifier had the accuracy of 72% (precision 75.69%, recall 75.03%) for female-model, 72% (precision 73.26%, recall 71.47%) for male-model and 59% (62.32% precision, recall 59.48%) for gender-agnostic-model. RF performed best among three classifiers used for all three models: female-model 78% accuracy (precision 78% and recall 78.41%), male-model 73.91% accuracy (precision 73.87% and recall 73.99%) and gender-agnostic-model 65.77% accuracy (precision 67.78% and recall 66.43%). From this information from Table 4, we can say that RF outperformed SVM and KNN across all three models in terms of accuracy, precision, and recall. SVM and KNN showed varying performance across different models, and the RF classifier consistently demonstrated the highest accuracy for each model.

Table 4 Different models’ accuracy, precision, and recall for different algorithms

Full size table

Gender-agnostic-model This model was built using both male and female audio files. We performed hyperparameter optimization, and the best combination was a sample rate of 44,100 Hz, 100 keypoints (using an ORB extractor), 5 clusters while building bag-of-visual-words, and an RF algorithm as the classifier.

Female-model This model was customized for female audio files. We performed hyperparameter optimization, and the best combination was a sample rate of 22,050 Hz, 150 keypoints (using an ORB extractor), 10 clusters while building bag-of-visual-words, and an RF algorithm as the classifier.

Male-model This model was customized for male audio files. We performed hyperparameter optimization, and the best combination was a sample rate of 11,025 Hz, 300 keypoints (using an ORB extractor), 10 clusters while building bag-of-visual-words, and an RF algorithm as the classifier.

In the second phase, our objective was to investigate if employing a single model for different genders fails to generate an impartial solution for audio sentiment analysis. This serves as an implication that machine learning models have the potential to exhibit gender biases.

Table 5 presents the accuracy results of three distinct models—Gender-agnostic-Model, Female-Model, and Male-Model—based on the gender of the audio files they were tested on. The accuracy percentages reflect how well each model performed under different gender scenarios:

Table 5 Accuracy of models based on the gender of audio files

Full size table

Both gender audio files The gender-agnostic-model achieved an accuracy of 66%, the female-model scored 62%, and the male-model attained a 64% accuracy when tested on audio files containing speech from both male and female speakers.

Female audio files When tested on audio files with female speech, the gender-agnostic-model exhibited an accuracy of 62%, the female-model demonstrated higher accuracy at 78%, and the male-model had an accuracy of 60%.

Male audio files When evaluated on audio files containing male speech, the gender-agnostic-model achieved an accuracy of 61%, the female-model scored 57%, and the male-model displayed the highest accuracy at 74%.

In summary, the table provides insights into how each model’s accuracy varies across different gender-specific scenarios, indicating their performance on audio files with both male and female speech, exclusively female speech, and exclusively male speech.

5 Discussion and conclusion

Our datasets have a good representation of diverse geographies and culture. We employed multilingual audio files, comprising 1,115 emotional utterances in English, Spanish, and Italian for audio sentiment analysis, encompassing both male and female speakers. In our experiments, we observed that machine learning models do not consistently exhibit fairness. Fairness entails treating various demographic groups with equality. Nevertheless, the gender-specific model struggled to capture the sentiment of genders it was not trained on, and gender-agnostic model showed poor performance in accurately classifying sentiments for both female and male audio samples, as evidenced by the accuracy results. The performance of our best models show that spectrogram representation of audio data is better able to capture the salient features of female voice, resulting in superior accuracy for the female sentiment analysis model (78% accuracy) than that of the male-model (74% accuracy). The models further show biased performance when the audio samples used to train a sentiment analysis model differ in gender representation from those used for testing. This bias can be the result of voice features being different for different genders. A study in [38] explores the gender role in audio emotion analysis. The researchers conducted experiments using various features to investigate how these features differ based on gender. For instance, they observed that the pitch, determined by frequency, tends to be higher in male adult voices compared to female voices. Additionally, the amplitude, which determines sound loudness, is slightly higher in males than in females. Spectrum, reflecting the energy distribution of voice in the frequency domain and calculable using vocal jitter and shimmer, also exhibits gender-related variations. Other features, such as vocal tract length, harmonic structure, and speech rate, differ between genders. These gender-related differences in audio features highlight the importance of not using the same parameters when building models, as doing so can introduce bias. Consequently, in our experiments, we chose to construct personalized models instead of gender-agnostic models to address these disparities. Models designed for specific genders in audio sentiment analysis are trained using segregated gender-based data, potentially allowing for more effective capture of gender-specific nuances and patterns. Consequently, this approach aims to enhance the accuracy of these models. Enhancing accuracy in audio sentiment analysis can have widespread benefits across various domains such as healthcare, business, education, social media, and more.

The three pieces of evidence shown in the result section clearly establish that utilizing AI algorithms with the same parameters is not fair when it comes to audio sentiment analysis of audio files from two different genders, male and female, for building a one-size-fits-all audio sentiment analysis model. In other words, features and modeling techniques can introduce biases. Below are the three pieces of evidence we observed from our experiments:

1.
The gender-agnostic-model showed poor performance on the audio samples from both female and male voices. Additionally, neither of the gender-specific models (i.e., female-model and male-model) achieved overall accuracy equality when tested with the audio samples from both genders.
2.
The model that performed best with female audio samples performed significantly worse when tested with male audio samples.
3.
The model that performed best with male audio samples performed significantly worse when tested with female audio samples.

The result displayed in Table 5 substantiates the need for gender-specific personalized models or algorithms that can handle these differences and still provide better results. Another valuable insight we inferred from the result was male-model performed better compared to female-model in a general setting, i.e., when both genders’ audio samples were used (male-model 64% vs female-model 62%). Additionally, male-model performed slightly better than female-model when opposite gender’s audio samples were used, i.e., female audio samples for male-model (60%) and male audio samples for female-model (57%). However, the best model based on accuracy out of all models is a female-model with 78% accuracy when tested against female audio samples. These comparisons prove that ML algorithms exhibit gender biases in audio sentiment analysis tasks. The performance serves as one indicator of algorithmic bias, and these accuracy comparisons demonstrate bias in AI algorithms employed for audio sentiment analysis.

In this experiment, we have tried to showcase a scenario that might happen in real-world settings. If a model is trained on male-voiced audio samples during development but later encounter female-voiced samples for deployment or vice versa, the model will perform unfairly. From the findings in Table 5, it is apparent that utilizing a male-model for female audio samples results in a low accuracy of 60%. Additionally, the gender-specific models exhibit subpar performance with audio datasets containing both male and female audio. This is explicitly demonstrated in Table 5: the accuracy of the male-model is 74%, but when applied to both genders’ audio datasets, it drops to 64%. Similarly, the accuracy of the female-model is 78%, but when applied to both genders’ audio datasets, it decreases to 62%. Hence, in the real world, implementing gender-specific models trained for each gender can enhance audio sentiment analysis performance. For example, in call centers, analyzing the sentiment of customer calls proves invaluable for issue identification, customer satisfaction evaluation, and enhancing service quality. It is essential to note that if the training data primarily consists of male audio and is deployed in situations where both male and female voices are present, the model might struggle to effectively capture sentiments. Another area that could benefit from gender-specific audio sentiment analysis models is the entertainment industry. Understanding the diverse responses of different genders to content like movies and TV shows can yield crucial insights. These insights, in turn, can be utilized to tailor content to specific target audiences, enhancing engagement and reception. Numerous other fields could also realize advantages from the implementation of gender-specific audio sentiment analysis models, including healthcare, voice-based security systems, social media monitoring, online learning platforms, and more. Nonetheless, considering demographic factors in audio sentiment analysis models requires careful attention to ethical considerations. It is important to minimize any potential unfair advantages or disadvantages to any demographic group. Additionally, obtaining users’ consent and transparently communicating the purpose behind using gender-specific models are critical for ensuring informed decision-making. Achieving a balance between the benefits of such models and upholding privacy, fairness, and inclusivity is critical for promoting ethical model deployment in real-world applications.

We developed a gender-specific trained model for doing justice to both genders’ audio files. This model could be deployed in various real-world scenarios, including but not limited to call centers, the entertainment industry, and healthcare. The dataset we used had demographic information such as gender in the filename, which made it easier for us to separate the files based on gender and develop the gender-specific model. But in many cases, datasets do not provide such demographic information. For such a scenario, one of the solutions could be first to use the gender recognition model and then pass the audio files to the gender-specific personalized model. Personalizing the model could be a way to increase the accuracy of the model for both genders. When we average the results from both gender-specific models (i.e., male-model achieved 74% accuracy with male audio input and female-model achieved 78% accuracy with female audio input), we get an accuracy of 76%, surpassing the performance when both genders’ audio samples were used together without gender consideration, which yielded only 66% accuracy. This is the resolution we offer to mitigate bias by constructing a model for audio sentiment analysis that considers the demographic factor of gender. This solution of develo** an ensemble of gender recognition and gender-specific models and combining the results can provide a personalized touch to the model instead of a one-size-fits-all approach.

In conclusion, similar to other domains such as facial recognition, healthcare, and business, our findings indicate that audio sentiment analysis is subject to gender bias. The evident disparity in model accuracy highlights the presence of gender bias in the audio sentiment analysis domain. The female-specific model attains a 78% accuracy for female audio datasets; however, when applied to audio datasets encompassing both genders, the same model's accuracy decreases to 62%. Similarly, the male-specific model achieves a 74% accuracy for male audio datasets, but when applied to audio datasets covering both genders, the same model’s accuracy declines to 64%. Furthermore, the gender-agnostic model demonstrates notably poor performance compared to gender-specific models, yielding an accuracy of only 66%. We propose employing an ensemble approach comprising gender recognition and gender-specific models to enhance the overall accuracy of the audio sentiment analysis model from 66 to 76%. Utilizing the ensemble approach holds potential advantages in various real-world applications, such as call centers, advertising, and the entertainment industry; however, careful ethical evaluation is necessary to ensure responsible deployment. Increasingly, researchers are acknowledging and underscoring the significance of fairness in machine learning and AI systems. Notably, no existing studies were found that specifically investigated the fairness of machine learning algorithms concerning gender biases in the context of audio sentiment analysis. This paper employs three popular machine learning algorithms—Random Forest (RF), SVM, and K-nearest neighbors (KNN) to demonstrate the significant impact that gender differences can have on model accuracy.

6 Limitations

In our experimental study, we were able to detect the presence of bias related to demographic factors such as gender within machine learning models. If there were a diverse dataset with many demographic characteristics including race, age, and ethnicity, we could potentially develop more equitable models exhibiting high accuracy. The core idea we emphasize is the importance of recognizing demographic representations, whether related to gender, race, or ethnicity, within datasets. Instead of pursuing a one-size-fits-all approach, adopting a demographic segmentation strategy is essential. Our focus was on binary genders because we only had data from individuals of these genders. However, further research is needed to determine whether our claim is valid for people who identify beyond binary gender spectrum. There could be many factors contributing to biases in audio sentiment analysis, and our research highlights gender being one such factor. Our findings are drawn from the dataset we used in this study and have not been validated on alternative datasets. We want to test the reproducibility of our findings using diverse datasets in our future research.

Data availability

The datasets generated during and/or analyzed during the current study are available in the Zenodo repository, EmoFilm.

References

Mehrabi, N., Morstatter, F., Saxena, N., Lerman, K., Galstyan, A.: A survey on bias and fairness in machine learning. ACM Comput. Surv.Comput. Surv. 54(6), 1–35 (2021). https://doi.org/10.1145/3457607
Article Google Scholar
De Prado, M., et al.: Bonseyes AI pipeline—bringing AI to you: end-to-end integration of data, algorithms, and deployment tools. ACM Trans. Internet Things 1(4), 1–25 (2020). https://doi.org/10.1145/3403572
Article MathSciNet Google Scholar
Eche, T., Schwartz, L.H., Mokrane, F.-Z., Dercle, L.: Toward generalizability in the deployment of artificial intelligence in radiology: role of computation stress testing to overcome underspecification. Radiol. Artif. Intell. (2021). https://doi.org/10.1148/ryai.2021210097
Article Google Scholar
Chua, I.S., et al.: Artificial intelligence in oncology: path to implementation. Cancer Med. 10(12), 4138–4149 (2021). https://doi.org/10.1002/cam4.3935
Article Google Scholar
Rodolfa, K.T., Lamba, H., Ghani, R.: Empirical observation of negligible fairness–accuracy trade-offs in machine learning for public policy. Nat. Mach. Intell. 3(10), 896–904 (2021). https://doi.org/10.1038/s42256-021-00396-x
Article Google Scholar
Wing, J.M.: Trustworthy AI. Commun. ACM 64(10), 64–71 (2021). https://doi.org/10.1145/3448248
Article Google Scholar
Bellamy, R.K.E., et al.: AI Fairness 360: an extensible toolkit for detecting and mitigating algorithmic bias. IBM J. Res. Dev. 63(4/5), 4–15 (2019). https://doi.org/10.1147/JRD.2019.2942287
Article Google Scholar
Tran, A., Yang, T., Ai, Q.: ULTRA: an unbiased learning to rank algorithm toolbox. In: Proceedings of the 30th ACM International Conference on Information & Knowledge Management, New York, NY, USA: ACM, pp. 4613–4622 (2021). https://doi.org/10.1145/3459637.3482020
Yang, T., Luo, C., Lu, H., Gupta, P., Yin, B., Ai, Q.: Can clicks be both labels and features? In: Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, New York, NY, USA: ACM, pp. 6–17 (2022). https://doi.org/10.1145/3477495.3531948
Corbett-Davies, S., Gaebler, J.D., Nilforoshan, H., Shroff, R., Goel, S.: The measure and mismeasure of fairness (2018)
Mitchell, S., Potash, E., Barocas, S., D’Amour, A., Lum, K.: Algorithmic fairness: choices, assumptions, and definitions. Annu. Rev. Stat. Appl. 8(1), 141–163 (2021). https://doi.org/10.1146/annurev-statistics-042720-125902
Article MathSciNet Google Scholar
Munck, M.: Fairness in AI. https://2021.ai/fairness-in-ai/. [Online]. https://2021.ai/fairness-in-ai/. Accessed 22 Aug 2023
Schmitz, M., Ahmed, R., Cao, J.: Bias and fairness on multimodal emotion detection algorithms (2022). https://doi.org/10.13140/RG.2.2.14341.01769
Ricci Lara, M.A., Echeveste, R., Ferrante, E.: Addressing fairness in artificial intelligence for medical imaging. Nat. Commun.Commun. 13(1), 4581 (2022). https://doi.org/10.1038/s41467-022-32186-3
Article Google Scholar
Liu, H., Dacon, J., Fan, W., Liu, H., Liu, Z., Tang, J.: Does gender matter? Towards fairness in dialogue systems (2019)
Chen, R.J. et al.: Algorithm fairness in AI for medicine and healthcare (2021)
Xu, J., et al.: Algorithmic fairness in computational medicine. EBioMedicine 84, 104250 (2022). https://doi.org/10.1016/j.ebiom.2022.104250
Article Google Scholar
Mhasawade, V., Zhao, Y., Chunara, R.: Machine learning and algorithmic fairness in public and population health. Nat. Mach. Intell. 3(8), 659–666 (2021). https://doi.org/10.1038/s42256-021-00373-4
Article Google Scholar
Luitel, S., Anwar, M.: Audio sentiment analysis using spectrogram and bag-of- visual-words. In: 2022 IEEE 23rd international conference on information reuse and integration for data science (IRI), IEEE, pp. 200–205 (2022). https://doi.org/10.1109/IRI54793.2022.00052
Jia, Y., SungChu, S.: A deep learning system for sentiment analysis of service calls (2020)
García-Ordás, M.T., Alaiz-Moretón, H., Benítez-Andrades, J.A., García-Rodríguez, I., García-Olalla, O., Benavides, C.: Sentiment analysis in non-fixed length audios using a fully convolutional neural network. Biomed. Signal Process. Control 69, 102946 (2021). https://doi.org/10.1016/j.bspc.2021.102946
Article Google Scholar
Abdelhamid, A.A.: Speech emotions recognition for online education. Fusion Pract. Appl. 10(1), 78–87 (2023). https://doi.org/10.54216/FPA.100104
Article Google Scholar
Dupuis, K., Pichora-Fuller, M.K.: Recognition of emotional speech for younger and older talkers: behavioural findings from the toronto emotional speech set. Can. Acoust.Acoust. 39(3), 182–183 (2011)
Google Scholar
Livingstone, S.R., Russo, F.A.: The ryerson audio-visual database of emotional speech and song (RAVDESS): a dynamic, multimodal set of facial and vocal expressions in North American English. PLoS ONE 13(5), e0196391 (2018). https://doi.org/10.1371/journal.pone.0196391
Article Google Scholar
Murray, I.R., Arnott, J.L.: Toward the simulation of emotion in synthetic speech: a review of the literature on human vocal emotion. J. Acoust. Soc. Am. 93(2), 1097–1108 (1993). https://doi.org/10.1121/1.405558
Article Google Scholar
Pessach, D., Shmueli, E.: A review on fairness in machine learning. ACM Comput. Surv.Comput. Surv. 55(3), 1–44 (2023). https://doi.org/10.1145/3494672
Article Google Scholar
Buolamwini, J., Gebru, T.: Gender shades: intersectional accuracy disparities in commercial gender classification. In: Friedler, S.A., Wilson, C. (eds.) Proceedings of the 1st conference on fairness, accountability and transparency. Proceedings of machine learning research, vol. 81. PMLR, pp. 77–91. [Online] (2018). https://proceedings.mlr.press/v81/buolamwini18a.html. Accessed 15 Feb 2024
Kim, T. et al.: Countering racial bias in computer graphics research. In: Special interest group on computer graphics and interactive techniques conference talks, New York, NY, USA: ACM, pp. 1–2 (2022). https://doi.org/10.1145/3532836.3536263
Rajkomar, A., Hardt, M., Howell, M.D., Corrado, G., Chin, M.H.: Ensuring fairness in machine learning to advance health equity. Ann. Intern. Med. 169(12), 866 (2018). https://doi.org/10.7326/M18-1990
Article Google Scholar
Jamali, H., et al.: Racial disparity in oxygen saturation measurements by pulse oximetry: evidence and implications. Ann. Am. Thorac. Soc. 19(12), 1951–1964 (2022). https://doi.org/10.1513/AnnalsATS.202203-270CME
Article Google Scholar
Dastin, J.: Amazon scraps secret AI recruiting tool that showed bias against women *. In: Ethics of data and analytics, pp. 296–299. Auerbach Publications, Boca Raton (2022). https://doi.org/10.1201/9781003278290-44
Datta, A., Tschantz, M.C., Datta, A.: Automated experiments on ad privacy settings: a tale of opacity, choice, and discrimination (2014)
Thelwall, M.: Gender bias in sentiment analysis. Online Inf. Rev. 42(1), 45–57 (2018). https://doi.org/10.1108/OIR-05-2017-0139
Article Google Scholar
Kiritchenko, S., Mohammad, S.M.: Examining gender and race bias in two hundred sentiment analysis systems (2018)
Diaz, M., Johnson, I., Lazar, A., Piper, A.M., Gergle, D.: Addressing age-related bias in sentiment analysis. In: Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems, New York, NY, USA: ACM, pp. 1–14 (2018). https://doi.org/10.1145/3173574.3173986
Parada-Cabaleiro, E., Costantini, G., Batliner, A., Baird, A., Schuller, B.: Categorical vs dimensional perception of Italian emotional speech. In: Interspeech 2018, ISCA: ISCA, pp. 3638–3642 (2018). https://doi.org/10.21437/Interspeech.2018-47
Rublee, E., Rabaud, V., Konolige, K., Bradski, G.: ORB: an efficient alternative to SIFT or SURF. In: 2011 International Conference on Computer Vision, IEEE, pp. 2564–2571 (2011). https://doi.org/10.1109/ICCV.2011.6126544
Huang, K.-L., Duan, S.-F., Lyu, X.: affective voice interaction and artificial intelligence: a research study on the acoustic features of gender and the emotional states of the PAD model. Front. Psychol. (2021). https://doi.org/10.3389/fpsyg.2021.664925
Article Google Scholar

Download references

Acknowledgements

Not applicable.

Funding

Open access funding provided by the Carolinas Consortium. This work was partially supported by Quantum Computing Research Center < QC|RC > at North Carolina Agricultural and Technical State University.

Author information

Authors and Affiliations

Human-Centered AI (HCAI) Lab, Department of Computer Science, North Carolina Agricultural and Technical State University, Greensboro, NC, USA
Sophina Luitel, Yang Liu & Mohd Anwar
Quantum Computing Research Center <QC|RC>, Department of Computer Science, North Carolina Agricultural and Technical State University, Greensboro, NC, USA
Yang Liu & Mohd Anwar

Authors

Sophina Luitel
View author publications
You can also search for this author in PubMed Google Scholar
Yang Liu
View author publications
You can also search for this author in PubMed Google Scholar
Mohd Anwar
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

All authors contributed equally to this paper.

Corresponding author

Correspondence to Mohd Anwar.

Ethics declarations

Conflict of interest

The authors declare that they have no conflicts of interest/competing interests.

Ethical approval

Not applicable.

Consent to participate

Not applicable.

Consent for publication

Not applicable.

Additional information

Correspondence and requests for materials should be addressed to M.A.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Luitel, S., Liu, Y. & Anwar, M. Investigating fairness in machine learning-based audio sentiment analysis. AI Ethics (2024). https://doi.org/10.1007/s43681-024-00453-2

Download citation

Received: 27 September 2023
Accepted: 29 February 2024
Published: 25 March 2024
DOI: https://doi.org/10.1007/s43681-024-00453-2

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Investigating fairness in machine learning-based audio sentiment analysis

Abstract

Similar content being viewed by others

When Siri Knows How You Feel: Study of Machine Learning in Automatic Sentiment Recognition from Human Speech

Emotion Recognition and Sentiment Analysis of Extemporaneous Speech Transcriptions in Russian

ShEMO: a large-scale validated database for Persian speech emotion detection

1 Introduction

2 Related Work

3 Methodology

4 Results

5 Discussion and conclusion

6 Limitations

Data availability

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Ethical approval

Consent to participate

Consent for publication

Additional information

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Investigating fairness in machine learning-based audio sentiment analysis

Abstract

Similar content being viewed by others

When Siri Knows How You Feel: Study of Machine Learning in Automatic Sentiment Recognition from Human Speech

Emotion Recognition and Sentiment Analysis of Extemporaneous Speech Transcriptions in Russian

ShEMO: a large-scale validated database for Persian speech emotion detection

1 Introduction

2 Related Work

3 Methodology

4 Results

5 Discussion and conclusion

6 Limitations

Data availability

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Ethical approval

Consent to participate

Consent for publication

Additional information

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation