1 Introduction

Artificial intelligence (AI) has been used in many classification and recognition tasks including sentiment analysis. However, in the race to constantly build models with high accuracy, researchers have not always paid attention to uneven system capability development [1]. Increasingly, researchers are starting to realize that a high accuracy in standard benchmark datasets does not guarantee successful deployment in real-world scenarios [2,3,4]. There are many demographic factors such as gender, age, and ethnicity that play a role in real-world scenarios that standard benchmark dataset does not always provide. Due to this issue, many machine learning models are never deployed for production. Nonetheless, researchers are becoming aware of this issue, therefore, artificial intelligence’s reliability, credibility, and fairness have attracted people’s attention, especially for human-in-the-loop or human-centered AI. Recent research has raised concerns over fairness implications [5]. A growing number of researchers have realized that the “one-size-fits-all” approach does not fit at all in AI systems and are committed to creating and designing AI algorithms that are trustworthy [6], fair [7], and unbiased [8, 9]. Machine learning and AI tools are used to make human life easier with the desire that the solutions would do what humans would have done but are faster, more consistent, and unbiased. But the irony here is that those desired solutions are frequently observed to perform better for some demographic groups than for others, making biased judgments and escalating inequality [10, 11]. Hence, utilizing as much demographic data as possible could make the solution nearest to how humans would have done it themselves without bias.

The fairness of AI algorithms is an evolving and adapting research area that stems from the general need for decision-making to be free from bias and discrimination [12]. Schmitz et al. experimented with how to balance the accuracy and fairness of the multimodal architectures for emotion recognition and discovered that the fairest bimodal model is an audio + video fusion [13]. Ricci Lara et al. discussed the meaning of fairness in medical image processing and commented on potential sources of bias and available strategies to mitigate bias [14]. In order to explore the issue of fairness in dialogue systems, Liu et al. constructed a benchmark dataset and proposed quantitative measures to understand fairness in dialogue models. They found that popular dialogue models considerably prejudiced different genders and races and provided two methods to mitigate bias in dialogue systems [15]. In the context of current issues in healthcare, Chen et al. summarized the fairness in machine learning and its intersectional field, outlining how algorithmic biases arise in existing clinical workflows and the healthcare disparities that result from these issues [16]. Although research in [1, 16,17,18] show that AI algorithms can be biased against specific populations or groups in various situations, there is a gap in understanding fairness in audio sentiment analysis.

We used 442 audio files with male and female voices, transformed them into spectrograms, and used bag-of-visual-words feature representation and machine learning algorithms such as Random Forest (RF), Support Vector Machines (SVM) and K-nearest Neighbors (KNN) to investigate the fairness of the audio sentiment analysis models in terms of overall accuracy equality between female and male genders. We found that models generated using the same parameters and algorithms do not perform the same way for different genders’ audio files. Hence, using a gender-agnostic model to analyze the sentiments of different genders results in poor accuracy in audio sentiment analysis. In a previous study, a model was generated and tested using only female audio files obtaining an accuracy of 76% [19]. In this study, we generated a gender-agnostic-model using both male and female audio files together, but the accuracy decreased to 66%. We observed a notable difference in accuracy, prompting us to conduct experiments that revealed bias within the machine learning algorithms for audio sentiment analysis, particularly with respect to gender. Hence, we separated the audio files into male and female groups and built a gender-specific trained model. The female-model was trained using female audio datasets, while the male-model utilized male audio datasets. This approach aimed to address the poor accuracy of the gender-agnostic model. We then tested the accuracy of the gender-specific model against a gender-specific dataset and a dataset representing both genders. Even after performing hyperparameter optimization in the machine learning algorithms, the model trained with female audio files (female-model) which achieved the best accuracy of 78% did not perform well with male audio files attaining the accuracy of 57%. Similarly, the model trained for male audio files (male-model) that achieved the accuracy of 74% did not perform well with the female audio files attaining the accuracy of 60%. The observed differences in accuracy verify that we need a personalized approach based on gender to get better accuracy.

Our main contributions are the following:

  • We provided three pieces of evidence to highlight the need for addressing fairness concerns in machine learning models used in audio sentiment analysis tasks.

  • We offer a resolution to mitigate bias by constructing models for audio sentiment analysis that consider the demographic factor of gender.

  • We demonstrate the importance of considering the demographic factor of gender in audio sentiment analysis tasks, providing a valuable insight that could serve as a reference for future researchers.

The rest of the sections of this paper are organized as follows. Section 2 provides literature review for audio for sentiment analysis. Section 3 provides the methodology of the study followed by results in Sect. 4. Section 5 provides discussion and conclusion.

2 Related Work

Numerous studies investigate into audio sentiment analysis utilizing voices from both male and female speakers. However, there is a scarcity of subsequent experiments examining the potential impact of gender on the accuracy of a singular model employed for both genders [20,21,22]. Jia and SungChu reported an accuracy of 60.1% for the top-performing audio sentiment analysis model in multimodal sentiment analysis before fusion [20]. It is possible that accounting the demographic factor of gender and crafting a gender-specific model could have potentially yielded higher accuracy. Furthermore, the researchers substantiate our assertion regarding the significance of gender, despite their omission of explicit discussions on its impact within the models [21]. They employed three distinct datasets to assess the accuracy of their fully convolutional neural network models. Notably, the Toronto Emotional Speech Set (TESS) [23], an acted dataset comprising 2800 concise audio samples delivered by two actresses, achieved an accuracy rate of 99.03%. Conversely, another dataset, the Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS) [24], containing audio recordings from 24 individuals (12 male and 12 female), achieved an accuracy rate of 75.28%. These accuracy figures highlight a distinct trend: models trained using a singular gender’s voice consistently outperform those employing voices from both genders. Male and female voices tend to have different pitch ranges, tones, and resonance [25]. Consequently, employing identical techniques or models might not adequately capture the sentiment when analyzing voices from both female and male genders.

The authors explored the concept of fairness in machine learning algorithms, shedding light on the causes of algorithmic bias and unfairness, as well as common definitions and measures for fairness [26]. Additionally, they examined mechanisms designed to enhance fairness. Among the fairness measures discussed is overall accuracy equality, emphasizing the need for similar accuracy across diverse groups. In our experiment, attaining similar accuracy between male and female audio datasets using a model trained for a different gender proved to be challenging. Additionally, when employing a gender-agnostic model, the overall accuracy of the model was notably lower compared to gender-specific models, emphasizing the unfairness of the audio sentiment analysis models.

Research across diverse domains, including facial recognition, healthcare, business, and computer graphics extensively reports fairness issues. For instance, facial analysis methods reveal race and gender biases, leading to disproportionate misclassification of dark-skinned females [27]. Additionally, rendering algorithms in computer graphics predominantly favor light-skinned individuals [28]. Historical data biases in healthcare machine learning are evident [29], and despite pulse oximetry’s long-known reduced accuracy for dark skin tones [30], proper guidelines on addressing this bias in clinical practice are still lacking. Furthermore, Amazon uncovered that its machine learning hiring system exhibited bias against female candidates, especially in software development and technical roles. One potential explanation is the predominance of historical data primarily consisting of male software developers [31]. In a separate instance within the domain of advertising, it was revealed that Google’s ad-targeting algorithm tended to suggest higher-paying executive positions more frequently for men than for women [32]. These situations arise from training the model with historical data that has a higher concentration of data from one gender over the other. Such occurrences imply that the audio sentiment analysis domain might be subject to unfairness stemming from demographic factors in machine learning models. Our compelling evidence highlights gender bias, as the gender-agnostic model struggles to accurately capture sentiments from both genders, and gender-specific models perform poorly when applied to a different gender than their training data.

Several studies [33,34,35] addressed bias within text-based sentiment analysis. However, no research pertaining to bias was found within the domain of audio-based sentiment analysis. This represents an area where further exploration and discussion are warranted. This discrepancy stands out as a notable gap in the existing research landscape. To gain deeper insights into fairness within the machine learning community, we are investigating the concept of fairness in the context of machine learning-based audio sentiment analysis for gender-specific and gender-agnostic models.

3 Methodology

We employed the EmoFilm dataset [36]—a multilingual corpus comprising 1,115 emotional utterances in English, Spanish, and Italian that were taken from 43 films and 207 speakers (113 males and 94 females). There were 578 audio files labeled as male and 537 labeled as female. This research used the existing dataset and did not involve direct human participants. This dataset has never been used for sentiment analysis, with the combination of spectrograms and the bag-of-visual-words method. Each audio file’s label was included in the filename, making it very simple to identify the gender and emotion represented by the audio signal (i.e., fear, disgust, happiness, anger, sadness). For our binary sentiment classification goal, we have chosen 204 happiness and 238 sadness audio files equally divided among males and females.

We adopted the methodology outlined in a prior study [19], which involves generating spectrograms from audio files and subsequently creating histograms through bag-of-visual words. Later passed the histogram to classifiers such as RF, SVM, and KNN for sentiment classification. Figure 1 shows the pipeline of the experiment. All the steps were the same while develo** gender-specific models and gender-agnostic models; only the parameters differed. We did our experiment in two sections; first, we generated the histogram from audio files and later passed the histogram to classifiers. In the first part of the experiment, we transformed the audio files into a spectrogram using short-time Fourier transform (STFT) with the Librosa library in Python, where different sample rates were applied for different genders. The best combination of the parameters can be seen in the result section. Then, used the oriented FAST and rotated BRIEF (ORB) [37] algorithm to extract the keypoints and descriptors. Using the descriptors, we generated the histogram and saved it as a.csv file for later analysis. After transforming the raw audio dataset into the histogram.csv files, in the second part of the experiment, we generated multiple training and test datasets using a random 75:25 split ratio for gender-specific and gender-agnostic models. We used the training data to train our classifier models and then applied the testing data to extract sentiments from audio files, specifically those they were trained for. After achieving optimal accuracy for both gender-specific and gender-agnostic models, we conducted tests with a separate dataset to evaluate accuracy and examine gender bias in the audio sentiment analysis task. Hyperparameter optimization was done in many stages; we have listed the optimization that we performed:

  1. 1.

    Generating spectrogram We used different sample rates (11,025 Hz, 22,050 Hz, and 44,100 Hz) while generating spectrograms to check which one works best for each gender in our experiment. Sample rate refers to the quantity of individual data points captured per second in audio recording. We aimed to investigate whether adjusting sample rates could enhance the accuracy rate. The default sample rate in the Python Librosa library, utilized for generating spectrograms, was 22,050 Hz. Consequently, we opted to modify the rate by both doubling and halving it. We found that a sample rate of 22,050 Hz works best for the female gender, and a sample rate of 11,025 Hz works best for the male gender.

  2. 2.

    Building a visual dictionary In order to create a visual dictionary, Bag-of-Visual-Words techniques were used. The concept is adapted from information retrieval and NLP’s bag of words (BOW). Instead of counting words that appear in a document, in BOVW, we use image features (keypoints and descriptors) as words and make a frequency histogram. We used the ORB algorithm to extract keypoints from the spectrogram with 32-bit descriptors. We used a different number of keypoints (100, 150, 300) to test if we could improve the accuracy for male and female audio files. It was established that 150 keypoints for female audio files and 300 keypoints for male audio files work the best.

Fig. 1
figure 1

Pipeline of the gender-specific audio sentiment analysis model

After extracting keypoints, we utilized K-means clustering to create multiple clusters (5, 10, 20) for each image to compare and get the best performance for models customized for females, males, and both. We found that 10 clusters worked best for both male and female audio files.

  1. 1.3.

    Hyperparameter optimization algorithms We used hyperparameter optimization techniques using Randomized Search and Grid Search for the classifiers (RF, SVM, and KNN) to get the best parameters for building customized models for males and females. We were able to utilize the hyperparameter optimization to increase the models’ accuracy. We experimented with 243 scenarios to get the best hyperparameters for all three models (female-model, male-model, and gender-agnostic-model) using different sample rates, number of keypoints and clusters. Table 1 shows all the hyperparameters for the optimization of female, male, and gender-agnostic models’ accuracy.

Table 1 Hyperparameters for the optimization of the female, male and gender-agnostic models

4 Results

We analyzed our results in two phases. Firstly, based on accuracy we established the best model for gender-agnostic-model, female-model, and male-model performing hyperparameter optimization. Secondly, we investigated whether the optimal model for a given gender would retain its accuracy when applied to the opposite gender, aiming to assess the algorithm’s gender-related bias.

Table 2 shows 27 experiment scenarios developed for male-model with a constant sample rate of 11,025 Hz. For instance, scenario # 5 involves a sample rate of 11,025 Hz, 100 keypoints, 10 clusters, and the SVM classification algorithm. We employed two additional sample rates, 22,050 Hz and 44,100 Hz. This resulted in a total of 81 distinct experiment scenarios constructed using male audio samples (3 different sample rates for each of the 27 variations). We employed the same methodology to derive optimal female-model from female audio samples, as well as gender-agnostic-model from audio samples of both genders. Hence, a total of 243 (= 3 × 81) scenarios were generated by employing three distinct models, each comprising 81 unique scenarios. In the interest of brevity, not all scenarios are presented within the table. Table 3 shows the combination of parameters that generated the best models for three types of audio samples (female, male, and both).

Table 2 Hyperparameter optimization for male-models
Table 3 Best parameters for each gender-specific model and gender-agnostic model

Table 4 demonstrates different classifiers’ accuracy, macro average precision, and recall for all three models used. SVM classifier had the accuracy of 72.5% (precision 73.06%, recall 73.96%) for female-model, 72% (precision 72.50%, recall 71.63%) for male-model, and 56% (precision 59.23%, recall 56.54%) for gender-agnostic-model. KNN classifier had the accuracy of 72% (precision 75.69%, recall 75.03%) for female-model, 72% (precision 73.26%, recall 71.47%) for male-model and 59% (62.32% precision, recall 59.48%) for gender-agnostic-model. RF performed best among three classifiers used for all three models: female-model 78% accuracy (precision 78% and recall 78.41%), male-model 73.91% accuracy (precision 73.87% and recall 73.99%) and gender-agnostic-model 65.77% accuracy (precision 67.78% and recall 66.43%). From this information from Table 4, we can say that RF outperformed SVM and KNN across all three models in terms of accuracy, precision, and recall. SVM and KNN showed varying performance across different models, and the RF classifier consistently demonstrated the highest accuracy for each model.

Table 4 Different models’ accuracy, precision, and recall for different algorithms

Gender-agnostic-model This model was built using both male and female audio files. We performed hyperparameter optimization, and the best combination was a sample rate of 44,100 Hz, 100 keypoints (using an ORB extractor), 5 clusters while building bag-of-visual-words, and an RF algorithm as the classifier.

Female-model This model was customized for female audio files. We performed hyperparameter optimization, and the best combination was a sample rate of 22,050 Hz, 150 keypoints (using an ORB extractor), 10 clusters while building bag-of-visual-words, and an RF algorithm as the classifier.

Male-model This model was customized for male audio files. We performed hyperparameter optimization, and the best combination was a sample rate of 11,025 Hz, 300 keypoints (using an ORB extractor), 10 clusters while building bag-of-visual-words, and an RF algorithm as the classifier.

In the second phase, our objective was to investigate if employing a single model for different genders fails to generate an impartial solution for audio sentiment analysis. This serves as an implication that machine learning models have the potential to exhibit gender biases.

Table 5 presents the accuracy results of three distinct models—Gender-agnostic-Model, Female-Model, and Male-Model—based on the gender of the audio files they were tested on. The accuracy percentages reflect how well each model performed under different gender scenarios:

Table 5 Accuracy of models based on the gender of audio files

Both gender audio files The gender-agnostic-model achieved an accuracy of 66%, the female-model scored 62%, and the male-model attained a 64% accuracy when tested on audio files containing speech from both male and female speakers.

Female audio files When tested on audio files with female speech, the gender-agnostic-model exhibited an accuracy of 62%, the female-model demonstrated higher accuracy at 78%, and the male-model had an accuracy of 60%.

Male audio files When evaluated on audio files containing male speech, the gender-agnostic-model achieved an accuracy of 61%, the female-model scored 57%, and the male-model displayed the highest accuracy at 74%.

In summary, the table provides insights into how each model’s accuracy varies across different gender-specific scenarios, indicating their performance on audio files with both male and female speech, exclusively female speech, and exclusively male speech.

5 Discussion and conclusion

Our datasets have a good representation of diverse geographies and culture. We employed multilingual audio files, comprising 1,115 emotional utterances in English, Spanish, and Italian for audio sentiment analysis, encompassing both male and female speakers. In our experiments, we observed that machine learning models do not consistently exhibit fairness. Fairness entails treating various demographic groups with equality. Nevertheless, the gender-specific model struggled to capture the sentiment of genders it was not trained on, and gender-agnostic model showed poor performance in accurately classifying sentiments for both female and male audio samples, as evidenced by the accuracy results. The performance of our best models show that spectrogram representation of audio data is better able to capture the salient features of female voice, resulting in superior accuracy for the female sentiment analysis model (78% accuracy) than that of the male-model (74% accuracy). The models further show biased performance when the audio samples used to train a sentiment analysis model differ in gender representation from those used for testing. This bias can be the result of voice features being different for different genders. A study in [38] explores the gender role in audio emotion analysis. The researchers conducted experiments using various features to investigate how these features differ based on gender. For instance, they observed that the pitch, determined by frequency, tends to be higher in male adult voices compared to female voices. Additionally, the amplitude, which determines sound loudness, is slightly higher in males than in females. Spectrum, reflecting the energy distribution of voice in the frequency domain and calculable using vocal jitter and shimmer, also exhibits gender-related variations. Other features, such as vocal tract length, harmonic structure, and speech rate, differ between genders. These gender-related differences in audio features highlight the importance of not using the same parameters when building models, as doing so can introduce bias. Consequently, in our experiments, we chose to construct personalized models instead of gender-agnostic models to address these disparities. Models designed for specific genders in audio sentiment analysis are trained using segregated gender-based data, potentially allowing for more effective capture of gender-specific nuances and patterns. Consequently, this approach aims to enhance the accuracy of these models. Enhancing accuracy in audio sentiment analysis can have widespread benefits across various domains such as healthcare, business, education, social media, and more.

The three pieces of evidence shown in the result section clearly establish that utilizing AI algorithms with the same parameters is not fair when it comes to audio sentiment analysis of audio files from two different genders, male and female, for building a one-size-fits-all audio sentiment analysis model. In other words, features and modeling techniques can introduce biases. Below are the three pieces of evidence we observed from our experiments:

  1. 1.

    The gender-agnostic-model showed poor performance on the audio samples from both female and male voices. Additionally, neither of the gender-specific models (i.e., female-model and male-model) achieved overall accuracy equality when tested with the audio samples from both genders.

  2. 2.

    The model that performed best with female audio samples performed significantly worse when tested with male audio samples.

  3. 3.

    The model that performed best with male audio samples performed significantly worse when tested with female audio samples.

The result displayed in Table 5 substantiates the need for gender-specific personalized models or algorithms that can handle these differences and still provide better results. Another valuable insight we inferred from the result was male-model performed better compared to female-model in a general setting, i.e., when both genders’ audio samples were used (male-model 64% vs female-model 62%). Additionally, male-model performed slightly better than female-model when opposite gender’s audio samples were used, i.e., female audio samples for male-model (60%) and male audio samples for female-model (57%). However, the best model based on accuracy out of all models is a female-model with 78% accuracy when tested against female audio samples. These comparisons prove that ML algorithms exhibit gender biases in audio sentiment analysis tasks. The performance serves as one indicator of algorithmic bias, and these accuracy comparisons demonstrate bias in AI algorithms employed for audio sentiment analysis.

In this experiment, we have tried to showcase a scenario that might happen in real-world settings. If a model is trained on male-voiced audio samples during development but later encounter female-voiced samples for deployment or vice versa, the model will perform unfairly. From the findings in Table 5, it is apparent that utilizing a male-model for female audio samples results in a low accuracy of 60%. Additionally, the gender-specific models exhibit subpar performance with audio datasets containing both male and female audio. This is explicitly demonstrated in Table 5: the accuracy of the male-model is 74%, but when applied to both genders’ audio datasets, it drops to 64%. Similarly, the accuracy of the female-model is 78%, but when applied to both genders’ audio datasets, it decreases to 62%. Hence, in the real world, implementing gender-specific models trained for each gender can enhance audio sentiment analysis performance. For example, in call centers, analyzing the sentiment of customer calls proves invaluable for issue identification, customer satisfaction evaluation, and enhancing service quality. It is essential to note that if the training data primarily consists of male audio and is deployed in situations where both male and female voices are present, the model might struggle to effectively capture sentiments. Another area that could benefit from gender-specific audio sentiment analysis models is the entertainment industry. Understanding the diverse responses of different genders to content like movies and TV shows can yield crucial insights. These insights, in turn, can be utilized to tailor content to specific target audiences, enhancing engagement and reception. Numerous other fields could also realize advantages from the implementation of gender-specific audio sentiment analysis models, including healthcare, voice-based security systems, social media monitoring, online learning platforms, and more. Nonetheless, considering demographic factors in audio sentiment analysis models requires careful attention to ethical considerations. It is important to minimize any potential unfair advantages or disadvantages to any demographic group. Additionally, obtaining users’ consent and transparently communicating the purpose behind using gender-specific models are critical for ensuring informed decision-making. Achieving a balance between the benefits of such models and upholding privacy, fairness, and inclusivity is critical for promoting ethical model deployment in real-world applications.

We developed a gender-specific trained model for doing justice to both genders’ audio files. This model could be deployed in various real-world scenarios, including but not limited to call centers, the entertainment industry, and healthcare. The dataset we used had demographic information such as gender in the filename, which made it easier for us to separate the files based on gender and develop the gender-specific model. But in many cases, datasets do not provide such demographic information. For such a scenario, one of the solutions could be first to use the gender recognition model and then pass the audio files to the gender-specific personalized model. Personalizing the model could be a way to increase the accuracy of the model for both genders. When we average the results from both gender-specific models (i.e., male-model achieved 74% accuracy with male audio input and female-model achieved 78% accuracy with female audio input), we get an accuracy of 76%, surpassing the performance when both genders’ audio samples were used together without gender consideration, which yielded only 66% accuracy. This is the resolution we offer to mitigate bias by constructing a model for audio sentiment analysis that considers the demographic factor of gender. This solution of develo** an ensemble of gender recognition and gender-specific models and combining the results can provide a personalized touch to the model instead of a one-size-fits-all approach.

In conclusion, similar to other domains such as facial recognition, healthcare, and business, our findings indicate that audio sentiment analysis is subject to gender bias. The evident disparity in model accuracy highlights the presence of gender bias in the audio sentiment analysis domain. The female-specific model attains a 78% accuracy for female audio datasets; however, when applied to audio datasets encompassing both genders, the same model's accuracy decreases to 62%. Similarly, the male-specific model achieves a 74% accuracy for male audio datasets, but when applied to audio datasets covering both genders, the same model’s accuracy declines to 64%. Furthermore, the gender-agnostic model demonstrates notably poor performance compared to gender-specific models, yielding an accuracy of only 66%. We propose employing an ensemble approach comprising gender recognition and gender-specific models to enhance the overall accuracy of the audio sentiment analysis model from 66 to 76%. Utilizing the ensemble approach holds potential advantages in various real-world applications, such as call centers, advertising, and the entertainment industry; however, careful ethical evaluation is necessary to ensure responsible deployment. Increasingly, researchers are acknowledging and underscoring the significance of fairness in machine learning and AI systems. Notably, no existing studies were found that specifically investigated the fairness of machine learning algorithms concerning gender biases in the context of audio sentiment analysis. This paper employs three popular machine learning algorithms—Random Forest (RF), SVM, and K-nearest neighbors (KNN) to demonstrate the significant impact that gender differences can have on model accuracy.

6 Limitations

In our experimental study, we were able to detect the presence of bias related to demographic factors such as gender within machine learning models. If there were a diverse dataset with many demographic characteristics including race, age, and ethnicity, we could potentially develop more equitable models exhibiting high accuracy. The core idea we emphasize is the importance of recognizing demographic representations, whether related to gender, race, or ethnicity, within datasets. Instead of pursuing a one-size-fits-all approach, adopting a demographic segmentation strategy is essential. Our focus was on binary genders because we only had data from individuals of these genders. However, further research is needed to determine whether our claim is valid for people who identify beyond binary gender spectrum. There could be many factors contributing to biases in audio sentiment analysis, and our research highlights gender being one such factor. Our findings are drawn from the dataset we used in this study and have not been validated on alternative datasets. We want to test the reproducibility of our findings using diverse datasets in our future research.