1 Introduction

Speech emotion recognition (SER) is a branch of automatic emotion recognition and automatic speech recognition [1]. It recognizes the emotional state of speech by analyzing the acoustic features and linguistic content of the speech. It can currently be applied to multimodality generation tasks [2], assisted psychotherapy [3], video games [4] and telephone services [5]. The speech emotion recognition task is divided into two main phases: feature extraction and emotion classification. The speech signal is first processed based on time-domain and frequency-domain characteristics to quantize the raw speech. Subsequently, the processed data is fed into deep learning models for the purpose of emotion classification. The most popular models are convolutional neural network (CNN) [6], recurrent neural network (RNN) [7], long short-term memory network (LSTM) [8], as well as large-scale speech recognition models [9]. However, the voice state and emotional expression are variable at any time. It is still a great challenge to accurately identify the emotional state in short time.

Graph neural network (GNN) is an extension of convolutional networks on non-Euclidean data space, with the core idea being to construct good feature interpretability based on data association [10]. It has been successfully applied to computer vision and natural language processing tasks. Because speech is the combination of linear sequences, it is difficult to be converted into irregular non-Euclidean data. Therefore, the application of graph neural networks in speech signal processing is limited. In recent years, researchers have considered linear sequences as a special case of graph and applied graph convolution as encoder by transformations like line graphs, cycle graphs [11, 12], and complete graphs [13], building lightweight architectures with excellent performance. However, the relational structures of these compositions are single. The graph convolution is limited by the graph topology, which is not flexible, resulting in poor generalization ability in complex scenes.

This paper focuses on the task of the sentence-level speech emotion classification. To facilitate this task, individual frames are considered as nodes within the framework. The backbone of the model is constructed using a cycle graph, while the feature similarity between speech frames is computed to determine the connections between nodes. Specifically, the K edges with the highest weights are selected to establish these connections. For complex topological graphs, we choose the Message Passing Neural Network (MPNN) based on spatial-domain convolution to design a more flexible classification model.

Our contributions are as follows.

  1. 1)

    The development of a more adaptable directed graph of speech by leveraging feature similarity allows for greater flexibility in representing speech.

  2. 2)

    The introduction of a graph neural network architecture based on an LSTM Aggregator employs a message passing mechanism to capture input dependencies and facilitates accurate recognition of speech emotions, particularly in graphs with higher complexity.

  3. 3)

    The proposal of a weighted graph pooling operation for graph-level classification tasks enables the extraction of global features. The experimental results show that the weighted pooling can effectively remove redundant information and lead to a more stable convergence trend.

2 Related work

2.1 SER based on deep learning

Currently, classifiers of SER can be categorized into two types, traditional classifiers and deep learning classifiers. Traditional classifiers include Gaussian Mixture Models (GMMs), Hidden Markov Models (HMMs) and Support Vector Machines (SVMs), etc. [14], which rely on a lot of preprocessing and precision feature engineering [15]. With the development of deep learning technology, the performance of SER has gained significant improvement. Some studies have combined Deep Neural Networks (DNNs) and traditional classifiers, e.g., [16] proposes a DNN-decision tree SVM model based on DNN, which can capture more distinctive emotion features compared with traditional SVM and DNN-SVM.

Most of recognition frameworks based on neural networks utilize CNNs, LSTMs, and combinitions [17, 18]. For example, [19] modifies the initial model with an incremental approach, and inputs multiple acoustic features to a 1D CNN, which improves the accuracy. [20] constructs a robust and effective recognition model based on key sequence fragments, combining CNN and BiLSTM. Attention mechanism is another key for recognizer based on deep learning to deal with hidden information. Attention-based DNN can mine unevenly distributed features in speech and emphasize saliently emotional information, which better adapts to changes in speech emotion [21]. By directing self-attention to deal with missing and hidden information, the more robust structure [22] obtains the satisfactory performance. Furthermore, the challenge of building SER systems based on neural networks lies in the poor generalization due to data mismatch. To address this problem, [23, 24] make significant progress on generalization by sharing feature representations among auxiliary tasks through multi-task learning. However, the traditional recognition system based on deep learning has the complex structure and weak interpretability of speech features. The graph has been introduced into speech tasks as a compact and efficient representation. And the superiority of GNNs in graph processing has received widespread attention.

2.2 SER based on GNNs

At present, the application of graph neural networks in the field of speech technology still has some limitations [25], but some scholars have verified the advantages of graph convolution in the field of speech technology and the possibility of being widely used through research, such as conversational speech recognition [26], sentence-level [27] / conversation-level speech emotion recognition [28], speech enhancement [29], and Q &A rewriting [30]. The methods of graph construction can be divided into sample point-based, frame-based, speech channel-based, and historical dialogue-based approaches, as shown in Fig. 1. In addition, graph neural network has good performance in low-resource speech emotion recognition, such as [31] using transduction integrated learning algorithms based on graph neural networks to accomplish the challenge of Portuguese speech emotion classification.

Fig. 1
figure 1

It provides four examples of graph construction used in the above studies. The nodes of these graphs are frames, sample points, speech channels and dialogues

In current studies, researchers mostly use frame-based composition. Each frame is considered as one node. Additionally down-sampling is used to reduce the number of frames and simplify the structure. For example, the study [11] modeled the speech signal as a frame-based recurrent graph and constructed a lightweight and precise graph convolution architecture, achieving comparable performance with existing techniques. The studies [10, 12, 25] extend the context acceptance range by constructing neighbors within the specific times on the deep frame-level features obtained by recurrent neural networks. Similarly, the study [32] extends to dialogue speech emotion recognition by introducing CNN-BiLSTM to extract conversation features and constructing edges through a fixed past context window. These studies have a high dependence on the feature processing capability of sequence models, and the connections are relatively fixed. The study [33] proposes an ideal graph structure based on cosine similarity and constructs a graph convolutional network with better robustness. However, in practical applications, speech sequences are prone to problems of high feature similarity and feature instability. The threshold approach is not applicable to realistic scenarios.

To address the problems of inflexible graph structure and poor generalization ability in the above studies, this paper proposes a graph neural network based on LSTM aggregator and weighted pooling to transform the speech emotion recognition task into the graph classification task.

3 Proposed approach

In this part, we will discuss each component of Graph-LSTM neural network (GLNN) in detail.

3.1 Graph construction

Inspired by studies [11, 12], the speech signal is processed into frames, and each frame is considered as a node. To preserve feature integrity and build the scalable graph, the processing of downsampling and fixed-length cut is discarded. The speech with variable number of frames is transformed into the graph based on the temporal relationship and feature similarity. Thus, the speech graph is heterogeneous.

The graph dataset is represented as \(G = (V,E)\). V is the set of nodes, and E is the set of edges. The feature matrix of nodes in the figure is represented as X, \(X \in R^{n \times D}\), where n represents the number of nodes, and D represents the feature dimension, and \(x_{i}\) represents the feature vector of the i-th node. x, the feature vector of the node, is composed of a set of low-level descriptors extracted by openSMILE 3.0. The edges are constructed in two categories, one is the directed edges constructed by the temporal relationship. The one-way edges \(\left\{ v_{i}\rightarrow v_{i + 1}\}_{i = 1}^{n - 1} \right.\) are constructed only depending on the time, and finally the loop is established by \(\left. v_{n}\rightarrow v_{1} \right.\). The directed cycle graph is used as backbone to improve the stability of the graph structure. The other category is the directed edges obtained from the feature similarity calculation. In order to reduce the computational complexity, the dot product similarity operation is used as follows:

$$\begin{aligned} X = \frac{X}{\left\| X \right\| ^{2}}, weights = X \cdot X^{T} \end{aligned}$$
(1)
$$\begin{aligned} edges = \left\{ e_{ji} | j \in TopK(weights,k)\}_{i = 1}^{n} \right. \end{aligned}$$
(2)

where \(X \in R^{n \times D}\) is the feature matrix of nodes on the graph, and n represents the number of nodes, and D represents the feature dimension; X is standardized, and the dot product similarity between nodes is calculated to obtain the similarity weights. edges represent the set of constructed edges. j represents the index of the adjacent node of the i-th node selected by the TopK function. \(e_{ji}\) means the edge built between the i-th and the j-th nodes, pointing from the j-th node to the i-th node.

The heat map of the weights is shown in Fig. 2. According to the heat map, it is observed that the feature similarity between nodes is greatest in the region centered on the diagonal. And the feature similarity is higher in a small range of neighborhoods, which is consistent with the characteristics of speech temporal changes. In order to screen out redundant information and select the edges with the highest correlation, the TopK algorithm [34] is used to select the k nodes with the highest similarity to the target node \(v_{i}\). By conducting experimental verification, the value of k is set to 10, resulting in improved stability of the model’s convergence.

Fig. 2
figure 2

Similarity weighting heat map

3.2 Graph-LSTM neural network

The structure of Graph-LSTM neural network (GLNN) is shown in Fig. 3. The architecture based on speech-graph consists of three graph convolution layers, a pooling layer and a classifier. In Fig. 3, A is the overall structure of GLNN; B is the structure of graph convolution, consisting of LSTM aggregator and linear updater; C is the structure of weighted pooling layer.

Fig. 3
figure 3

The structure of GLNN. A is the overall structure of GLNN; B is the structure of graph convolution, consisting of LSTM aggregator and linear updater; C is the structure of weighted pooling layer. In addition, the solid line represents the backbone, and the dashed line represents the possible edges constructed by similarity in Graph of A

The model construction is based on the message passing network with two phases of forward passing, message aggregation and readout operation [35]. The convolution layers of Graph-LSTM model consist of aggregator and updater.

$$\begin{aligned} x_{i}^{'} = \varphi _{\alpha }\left( x_{i} \right) \end{aligned}$$
(3)
$$\begin{aligned} x_{aggr} = \underset{j \in N(i)}{Aggregator_{LSTM}}\left( \oplus x_{j} \right) \end{aligned}$$
(4)
$$\begin{aligned} x_{up} = \varphi _{\beta }\left( x_{i}^{'} \oplus x_{aggr} \right) + \gamma \end{aligned}$$
(5)

where \(\varphi _{\alpha }\) and \(\varphi _{\beta }\) represent the linear transformation; N(i) represents the neighborhood of the target node; \(x_{aggr}\) represents the neighborhood features obtained by aggregation, and \(x_{i}\) represents the feature vector of the i-th node, and \(x_{j}\) represents the features vector of adjacent points.

Based on the graph structure of 3.1, considering the continuity and complexity of speech features, the simple aggregation operation [6.

$$\begin{aligned} x_{pooling} = \alpha \cdot { \max _{i = 1}^{n} (}x_{i}) + \beta \cdot mean_{i = 1}^{n}\left( x_{i} \right) + \lambda \cdot sum_{i = 1}^{n}\left( x_{i} \right) \end{aligned}$$
(6)

where max, mean and sum represent the three types of global pooling operations; \(x_{i}\) represents the feature vector of the i-th node; \(x_{pooling}\) represents the global feature vector; \(\alpha\), \(\beta\) and \(\lambda\) represent the weights of the three pooling operations respectively, which are set to {0.3, 0.3, 0.3} in the experiment. Through the weighted pooling operations, the feature integrity is retained while removing redundant information.

4 Experiments

4.1 Dataset and features

The dataset used for the study is the Interactive Emotional Dyadic Motion Capture (IEMOCAP) database [38], containing 12 hours of audiovisual data. The data were collected from two-person situational dialogues which the actors performed in a scripted or improvised manner. The actors’ facial expressions and hand movements were recorded simultaneously during the communication. The speech emotion recognition task in this study uses only speech data, for five binary dialogues divided into multiple sentences. IEMOCAP uses the multi-annotator to annotate these data with 11 emotion labels. For objective experimental analysis and performance comparison, we used four classes of data in our experiments, namely angry, happy, sad, and neutral, totaling 4490 utterances.

The extraction of audio features is done with the open source tool called openSMILE 3.0 [39]. openSMILE is a large-space audio feature interpreter that is widely used for sentiment computing tasks. Audio feature extraction can be achieved through command line and configuration files. The experiment uses the INTERSPEECH 2010 Paralinguistic Challenge feature set to extract a set of low-level descriptors (LLDs) consisting of mfcc, maxPos, amean, skewness, and smoothing using the corresponding first-order delta coefficients. The speech is framed by a fixed-size sliding window, with the frame length set to 25 ms and the shift set to 10 ms. In addition, spontaneous binary features are added to each frame, inspired by spontaneous learning [\(D_{1}'\) denotes the map** dimension, and \(D_{2}'\) denotes the feature dimension of outputs. Firstly, the features of all nodes are mapped with the time complexity of \(O(n*D*D_{1}')\). Then the feature aggregation is performed by the LSTM aggregator with the complexity approximately equal to \(O(n*D_{1}'^{2})\). Finally, the feature updating is completed by the linear layer with the time complexity of \(O(n*D_{1}'*D_{2}')\). In summary, the time complexity is \(O(n(D_{1}'*D_{2}'+D_{1}^{{\prime }2}))\).

Table 2 Comparison between different layers

Table 3 analyzes the effect of k values when constructing edges by the TopK algorithm, i.e., the effect of the number of edges. The results in Table 3 show that the k value of 10 obtains the large improvement compared with the value of 5, with an improvement of 9.6% on WA. However, the gain of model performance is very small when the k value is taken as 15, indicating that the information obtained by adjacent nodes is saturated. Increasing the number of edges cannot bring extra information gain.

Table 3 Comparison of number of K

Table 4 compares the effects of different pooling methods on the accuracy. It should be noted that, in addition to three simple read-out operations of maximum, mean and summation, we also try to use topk pooling to filter out 50% of the nodes before performing mean pooling. From Table 4, the mean-pooling performs better than max-pooling and sum-pooling, but worse than topk-pooling. It indicates that filtering the nodes to remove redundant features helps to improve the performance. And weighted pooling maximally preserves the integrity of node features and effectively filters out representative features. Compared with other pooling methods, it has the better performance. Figure 4 shows the test curves of different pooling methods. As shown in Fig. 4, the weighted pooling can effectively mitigate oversmoothing and converge more stably.

Table 4 Comparison between different pooling methods
Fig. 4
figure 4

The convergence curves of five pooling methods. The blue, orange, green, red and purple curves represent max-pooling, mean-pooling, sum-pooling, topk-pooling and weighted-pooling repectively. Two types of curves, WA curve and UA curve, are drawn separately

5 Conclusion

In this paper, we explore a graph neural network based on LSTM aggregator and weighted pooling applied to speech emotion recognition task. The specific process is as follows. First, speech features are extracted by the openSMILE. Then, the connection relationship is selected for speech graph construction based on the feature similarity and TopK algorithm. Finally, a classification model is designed based on the message passing architecture to convert speech classification into a graph classification task. Our evaluation on the IEMOCAP dataset demonstrates superior performance compared to the baseline models. However, there are some shortcomings in the current stage, including 1) complex connections and a large number of redundant features in graph; 2) unstable processing and analysis of small datasets; 3) neglecting the speaker’s information. The research focuses on adult speech, which is a lack of exploration of children’s speech emotion recognition [47, 48].

In order to address the aforementioned challenges, we will adopt the following stategies in the next stage. 1) To address the issue of redundant features, we will consider more versatile approaches for graph construction to further reduce the requirement for data size and optimize the model framework. 2) Faced with the problem of data scarcity, the Transfer Learning strategy [49] is adopted to design a multi-task framework for speech recognition and emotion recognition, which improves the adaptability to small sample data through feature sharing. 3) To address the issue of differences in acoustic and linguistic features of speakers, a speaker converter is introduced to learn adaptive transformation, which enables the model to eliminate feature differences.