Twin-tower transformer network for skeleton-based Parkinson’s disease early detection

Ma, Lan; Huo, Hua; Liu, Wei; Zhao, Changwei; Wang, **xuan; Xu, Ningya

doi:10.1007/s40747-024-01507-y

Twin-tower transformer network for skeleton-based Parkinson’s disease early detection

Original Article
Open access
Published: 21 June 2024

(2024)
Cite this article

Download PDF

You have full access to this open access article

Complex & Intelligent Systems Aims and scope Submit manuscript

Twin-tower transformer network for skeleton-based Parkinson’s disease early detection

Download PDF

Lan Ma¹^na1,
Hua Huo^1,2^na1,
Wei Liu¹^na1,
Changwei Zhao¹^na1,
**xuan Wang¹^na1 &
…
Ningya Xu¹^na1

120 Accesses
Explore all metrics

Abstract

Parkinson’s disease is a chronic neurodegenerative condition accompanied by a variety of motor and non-motor clinical symptoms. Diagnosing Parkinson’s disease presents many challenges, such as excessive reliance on subjective scale scores and a lack of objective indicators in the diagnostic process. Develo** efficient and convenient methods to assist doctors in diagnosing Parkinson’s disease is necessary. In this paper, we study the skeleton sequences obtained from gait videos of Parkinsonian patients for early detection of the disease. We designed a Transformer network based on feature tensor fusion to capture the subtle manifestations of Parkinson’s disease. Initially, we fully utilized the distance information between joints, converting it into a multivariate time series classification task. We then built twin towers to discover dependencies within and across sequence channels. Finally, a tensor fusion layer was employed to integrate the features from both towers. In our experiments, our model demonstrated superior performance over the current state-of-the-art algorithm, achieving an 86.8% accuracy in distinguishing Parkinsonian patients from healthy individuals using the PD-Walk dataset.

Learning rich features for gait recognition by integrating skeletons and silhouettes

Article 07 June 2023

Sliding Covariance Matrix: Co-learning Spatiotemporal Geometry Feature for Skeleton Based Action Recognition

An adaptively multi-correlations aggregation network for skeleton-based motion recognition

Article Open access 06 November 2023

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Introduction

Parkinson’s disease (PD) was first characterized by James Parkinson, a prominent British medical scientist, in 1817. It is a degenerative disorder of the nervous system marked by a range of motor and non-motor symptoms [1]. The main cause of PD is the loss of dopaminergic neurons in the substantia nigra of the brain. However, early-stage PD patients may not exhibit noticeable clinical symptoms, which become evident only when there is significant neuronal loss, exceeding 60%. As such, there is no cure for PD, and early detection for timely intervention is critical.

The disease affects patients’ motor and non-motor systems, significantly impacting their daily lives. Motor symptoms, including resting tremor, bradykinesia, rigidity, and postural and gait instability, are the primary signs of PD. Specific motor symptoms such as resting tremor, bradykinesia, rigidity, and postural and gait instability are the four cardinal symptoms of PD [2]. Postural and gait issues can include freezing of gait, balance instability, falls, shuffling steps, and difficulty turning. Bradykinesia and freezing of gait typically occur in the middle and late stages of PD, while small steps, unsteadiness, and falls are more prevalent in the later stages [2].

Currently, early PD detection primarily depends on the observation of clinical symptoms. Treatment varies according to symptoms, and doctors often base their approach on the Unified Parkinson’s Disease Rating Scale and the Hoehn and Yahr scales, which are subjective and can evaluate movement, daily living abilities, disease progression, post-treatment status, and treatment side effects and complications. However, this method of diagnosis is largely dependent on the clinician’s experience, which can be subjective and inaccurate.

To summarize, clinical practice using these scales faces two significant problems: 1) Patients may miss early intervention opportunities due to late treatment, with clinical symptoms possibly being confounded by other diseases that cause abnormal gait; 2) The diagnosis of PD by doctors is too reliant on subjective scale scores, lacking objective indicators. We need an efficient and convenient way for doctors to diagnose PD and treat it according to its severity.

Parkinson’s disease patients exhibit distinctive gait characteristics; therefore, the automated quantification of gait motor disorders is crucial for the automated assessment of motor functions in PD patients. It is feasible to use sensors to capture patients’ movement information and to utilize sensor signals as an auxiliary means of diagnosing and treating Parkinson’s disease. With the advancement of deep learning, recent developments in artificial intelligence have offered new opportunities for the rapid advancement of early detection of Parkinson’s disease. Researchers have proposed numerous methods for the automatic prediction and evaluation of the disease’s pathological stage using machine learning. We categorize PD detection methods based on whether the sensors are in direct contact with the subject. Contact sensors include plantar pressure sensors, sensor-equipped pens, and wearable motion capture devices, among others. Non-contact sensor methods primarily rely on gait analysis through video.

PhysioNet is a widely-used dataset that collects gait signals via 16 pressure sensors placed under each foot. Many researchers [3,4,5,6,7,8] have studied this dataset with promising results. For gait feature detection, Refs. [9,10,11,12] employed feature extraction methods. References [9, 12] established a predictive pattern between normal gait and the gait of Parkinson’s disease patients based on timing signal detection, while Refs. [10, 11] derived a pattern based on signal frequency.

Regarding sensor-equipped pens, there are four Parkinson’s handwriting datasets, including the PaHaW dataset [13], HandPD dataset [14], NewHandPD dataset [15] and Parkinson’s Drawing Dataset [16]. These datasets record the process of participants writing patterns or words by hand with sensor-equipped pens. Subsequently, many researchers [16, 17] have conducted studies on these datasets. A wearable motion capture device-based method is proposed in [18], which captures human motion data using pressure or acceleration sensors and performs statistical analysis to extract motion features. Some researchers [19,20,21] have even developed wearable systems to detect PD symptoms. However, most sensors should be in contact with the patient’s body and assembled in clinics with the help of trained experts, which inevitably affects the assigned movements of PD patients. Besides, such rich and accurate data highly depends on laboratory-based experiments, which brings difficulties for routine assessments. Thus, the practical application of the sensor-based scheme to clinical applications has been limited.

Since PD affects many parts of the body and causes a wide range of motion problems, an accurate assessment of PD severity relies on a comprehensive analysis of human motion features. With the rapid development of computer vision, the non-contact sensor method based on gait video have been proposed to detect PD. The advantage of video-based diagnosis methods for Parkinson’s disease is that patients do not need to wear various sensors.

Reference [22] presents the evidence that demonstrates the relationship between human gait and PD, and illustrates the role of different gait analysis systems based on vision or wearable sensors. To the best of our knowledge, Refs. [23,24,25,26,27,28,29] have proposed vision-based methods for the automated assessment of gait motor disorder. Reference [23] develop a PD gait regression model that is capable of predicting the severity of motor dysfunction from given gait image sequences. Reference [24] use a multivariate ordinal logistic regression method and the relative contribution of each gait feature for regression to UPDRS-gait and SAS-gait scores was assessed. Reference [25] propose a novel two-stream spatial-temporal attention graph convolutional network for video assessment of PD gait motor disorder. However, these methods only adopt traditional machine learning methods or improve action recognition methods, and lack the spatiotemporal analysis of gait data. Reference [26] introduced the first Parkinson’s disease gait dataset and proposed an end-to-end method for the early diagnosis of Parkinson’s disease based on a graph convolutional network. This approach takes the patient’s skeleton sequences as input and returns the diagnostic results.

We address the limitations of vision-based methods by establishing a quantification technique for the automated assessment of gait motor disorder in PD patients using video-based computer vision technology. Specifically, we extract joint sequences from videos using a human pose estimation algorithm and convert the skeleton sequence into a multivariate time series by calculating the distance between joints. Traditional skeleton joint recognition models take the original skeleton tensor as input to explore the relationship between joints. In contrast, we use the distances between skeleton joints as input, inspired by the simple mathematical principle that if the three sides of two triangles are equal, the triangles are congruent. The human skeleton is comprised of joints, and the distances between these joints reflect the skeleton’s shape to some extent. This approach allows us to disregard the camera’s relative position to the human skeleton and focus on the changes within the skeleton itself.

However, the quantitative assessment based on multivariate time series of distances also faces three significant challenges. Current classification algorithms for multivariate time series problems tend to model single-channel data and combine multi-channel results with simple linear weighting, neglecting the interactions between channels. The extraction of channel and step features from multivariate time series is crucial for multivariate time classification. Traditional fusion methods typically employ linear fusion or set gating, so the integration of step and channel features in multivariate time series remains a critical issue.

Some new methods on neural network can be used to deal with multivariate time series classification tasks, the Transformer network [30] was originally designed to model tasks in machine translation, which has a good ability to encode data related to time series. References [31,32,33] focus on the global existence–uniqueness and input-to-state stability of the mild solution of impulsive reaction–diffusion neural networks with infinite distributed delays, which can be used in the time–space sampled-data scheme. These methods bring new ideas for us to deal with multivariate time series tasks.

In this paper, we use skeleton sequences extracted from video to analyze patients’ movement symptoms, thereby detecting Parkinson’s disease. We firstly introduce transformer network into Parkinson’s disease recognition and propose a transformer network with a tensor fusion layer for the classification of multiple time series. To explore the spatial and temporal relationships in multiple time series, we leverage the multi-head attention mechanism of the transformer model’s encoder and construct two separate towers. A tensor fusion layer is constructed to model and fuse the spatial, temporal, and cross-spacetime features for gait. We deploy the model on the dataset collected by Ref. [26] and analyse the step and channel features of gait with attention maps. This algorithm excels at distinguishing between healthy individuals and those with PD and can be used in clinical practice to aid doctors in diagnosing Parkinson’s disease. Consequently, it facilitates early intervention for patients and reduces diagnosis and treatment costs. Therefore, the method we propose holds substantial practical value.

To summarize, our main contributions are as follows:

We focus on utilizing skeleton sequence information to determine whether subjects have Parkinson’s disease. To address the problems mentioned above, we first transform the coordinate sequences of the skeleton into multivariate time series of joint distances. We then apply a multivariate time series classification algorithm to address the issue, providing a novel approach for classification based on skeleton sequences.
We developed a Transformer network to fulfill the task of multivariate time series classification. Current classification algorithms for multivariate time series problems simply model single-channel data and combine multi-channel results with simple linear weighting, paying less attention to the interaction between channels. To explore the channel and step relationship of joint distances’ series, we make full use of the multi-head attention mechanism of the encoder in the transformer model and build two towers. For the channel-wise tower, make full use of the advantages of the attention mechanism to explore the relationship between channels, while for the step-wise tower, we use an attention mechanism with positional encoding to learn step features of the distances.
Unlike conventional methods that rely on linear fusion or set gating, we innovatively employ tensor fusion. This fusion method’s advantage lies in its ability to comprehensively uncover the influence of step features, channel features, and their fusion features.

The rest of this paper is organized as follows: “Related work” section reviews related existing work. “Methods” section introduces the proposed method, and “Experiments and results” section presents the experimental system, results, as well as qualitative and quantitative analyses of the findings. The application of the model is discussed in “Application of the model” section. Finally, the study concludes in “Conclusions” section.

Related work

Skeleton based action recognition on PD detection

In recent years, Parkinson’s disease recognition methods have received more and more attention for the important clinical value of early diagnosis of Parkinson’s disease. There are a lot of algorithms for identifying Parkinson’s disease using plantar pressure [13,14,15,16,17,18], but there are few algorithms for identifying Parkinson’s disease based on Skeletons. With the development of the analysis methods of human motion, it is possible to identify Parkinson’s disease by skeletons. High-precision human pose estimation algorithms, such as OpenPose [34], AlphaPose [36] was used to extract human skeletons from each video clip. The dataset ultimately encompasses the coordinates of 17 skeleton joints obtained by HRNet [36] before and after video cutting. The extracted human skeleton from the video is illustrated in Fig. 2. Joint No.17 is the midpoint between joints No.5 and No.6.

Preprocessing is a critical step in pattern recognition and machine learning. Following Ref. [26], the joints representing the head were removed to prevent overfitting. After completing these operations, we obtain a skeleton tensor ${M^{T \times V \times C}}$ from each video clip, where T represents the number of video frames, V the number of joints, and C the location coordinate dimension. Specifically, we have $T=72$, $V=13$, and $C = 2$ (i.e., $x\mathrm{{ }}$ and $\mathrm{{ }}y)$. The visualization of a data sample is shown in Fig. 3.

Table 2 Joint combinations

Full size table

After completing the data preprocessing, we transform the coordinate sequences of the skeleton into a multivariate time series by calculating the Euclidean distances between every pair of joints. The list of joint combinations is displayed in Table 1. We then obtain $CO=C_{13}^2=78$ series based on Eq. 1:

$$\begin{aligned} \begin{aligned}&{P_{t,v}} = ({M_{t,v,{c_1}}},{M_{t,v,{c_2}}}), t \in [1,T], {v} \in [1,V],{M_{t,v,c}} \in M \\&co = Combination({v_i},{v_j}), {v_i} \in [1,V],\\&\quad {v_j} \in [1,V] \backslash {v_i},{v_i} < {v_j}, co \in [1,CO]\\&L_{t, co}= \left| {{P_{t,v_i}}{{P}_{t,v_j}}} \right| \\&\quad = \sqrt{{{\left( {{M_{t,v_i,c_1}} - {{M}_{t,v_j,c_1}}} \right) }^2} + {{\left( {{M_{t,v_i,c_2}} - {{M}_{t,v_j,c_2}}} \right) }^2}}\\ \end{aligned} \end{aligned}$$

(1)

${P_{t,v}}$ represents the coordinates of the skeleton joints, and co represents the number of such joint combinations. According to Eq. 1, we can derive the multivariate time series matrix $L \in {{\mathbb {R}}^{T \times {CO}}}$, which serves as the input for our network.

The visualization of the standardized multivariate time series is shown in Fig. 4. From the figure, it is evident that the multivariate time series exhibits a certain periodicity. The twin-tower Transformer network is specifically designed for such multivariate time series. We standardized the data before inputting it into our network. The structure of the network for this input is described in “Twin-tower transformer network” section.

Twin-tower transformer network

Our twin-tower Transformer network design builds upon improvements from various domains and tasks. We have developed frameworks that leverage the inherent sequential invariance of Transformers and their ability to learn features through attention mechanisms. The traditional Transformer was originally designed for machine translation tasks and features an encoder–decoder structure. Reference [49] investigated a simple extension of the current Transformer Networks with gating, named Gated Transformer Networks (GTN), for the multivariate time series classification problem. They introduced three extensions to adapt the Transformer to their needs: embedding, two towers, and gating. Following their research, we constructed two towers to learn the relationships within and between channels. Unlike GTN, we added a linear layer after each tower to capture the features of each tower separately. Furthermore, we introduced a tensor fusion layer that enables the model to fully understand the relationships of each tower. Our model architecture is depicted in Fig. 5.

Our model has a overall architecture using step-wise tower, channel-wide tower, tensor fusion layer, two cascaded feedforward neural networks LR (combining linear, ReLU layers) and LS (combining linear, Softmax layers), shown in Fig. 5. Each tower has a overall architecture using input embedding module, stacked encoder and a LS layer. Especially, positional encoding in step-wise tower indicate time sequence. Following the overall architecture [30], each encoder has a multi-head self-attention mechanism and a simple, positionwise fully connected feed-forward network. A residual connection was employed around each of the two sub-layers, followed by layer normalization.

Step-wise tower

Our step-wise tower has a overall architecture using input embedding module, positional encoding, stacked encoder, and a LS(combining linear, Softmax layers) layer, shown in Fig. 5.

Input embedding module aims to place step-wise input closer together in the embedding space. Specifically, we use a neural network consisting of a fully connected layer embed the step-wise input $L \in {{\mathbb {R}}^{T \times {CO}}}$ into a ${d_{model}}$-dimensional space ${F_{sem}} \in {{\mathbb {R}}^{T \times {d_{model}}}}$, the input embedding module having a ${d_{model}}$-dimensional output.

The traditional Transformer uses a positional encoding module to represent word order in the natural language. This can distinguish the position of the same word in different positions, reflecting the positional relationship between words. In our framework, we use a positional encoding module to distinguish the step of time series.

The encoders in step-wise tower explicitly capture the step-wise correlation of distance sequences by attention and masking, as shown in Fig. 5. To encode the temporal feature of distance sequences, we use the self-attention with mask to attend on each point cross all the distance channels by calculating the pair-wise attention weights among all the time steps. In the multi-head self-attention layers, the scaled dot-product attention formulates the attention matrix on all time step. Hence, temporal dependencies are first retrieved through the encoders in step-wise tower. The LS layer after stacked encoder is performed to be used as a dimension reduction, from which we obtain the output feature of the Step-wise Tower, and it plays one of inputs to tensor fusion layer.

Our encoder follows the overall architecture of encoder in traditional Transformer. As shown in Fig. 6, we use a self-attention mechanism approach. Self-attention, described in the original Transformer [30], is a mechanism for calculating semantic correlations between different items in a data series. According to the terms in [30], let Q, K, and V be the query matrix, key matrix, and value matrix generated by the linear transformation of the input features ${F_{sem}} \in {{\mathbb {R}}^{T \times {d_{model}}}}$ respectively, as follows:

$$\begin{aligned} \begin{aligned} \begin{array}{l} \left( {Q,K,V} \right) = {F_{sem}} \cdot \left( {{W^Q},{W^K},{W^V}} \right) \\ Q,K \in {{\mathbb {R}}^{T \times {d_k}}},V \in {{\mathbb {R}}^{T \times {d_v}}}\\ {W^Q},{W^K} \in {{\mathbb {R}}^{{d_{model}} \times {d_k}}},{W^V} \in {{\mathbb {R}}^{{d_{model}} \times {d_v}}} \end{array} \end{aligned} \end{aligned}$$

(2)

where ${W^Q},{W^K},{W^V}$ are learnable linear transformation coefficient matrices, ${d_k}$ represents the dimension of the column vector of Q, K and ${d_v}$ represents the dimension of the column vector of V.

First, we get the attention weight by calculating the dot product between the query matrix Q and the key matrix K, then, the Transformer which scales the first dimension and uses softmax to normalize the second dimension. In step-wide tower, we use masking before we sent the input to the softmax layer in attention.

$$\begin{aligned} \begin{aligned} \begin{array}{l} Attention\left( {Q,K,V} \right) = softmax\left( {\frac{{Q{K^T}}}{{\sqrt{{d_k}} }}} \right) V \end{array} \end{aligned} \end{aligned}$$

(3)

Multi-head attention allows the model to jointly attend to information from different representation subspaces at different positions. With a single attention head, averaging inhibits this.

$$\begin{aligned} \begin{aligned} \begin{array}{l} MultiHead\left( {Q,K,V} \right) = Concat\left( {hea{d_1}, \ldots , \mathrm{{ }}hea{d_h}} \right) {W^O}\\ hea{d_i} = Attention\left( {QW_i^Q,KW_i^K,VW_i^V} \right) \\ \end{array} \end{aligned} \end{aligned}$$

(4)

where the projections are parameter matrices $W_i^Q \in {{\mathbb {R}}^{{d_{model}} \times {d_k}}}$, $W_i^K \in {{\mathbb {R}}^{{d_{model}} \times {d_k}}}$, $W_i^V \in {{\mathbb {R}}^{{d_{model}} \times {d_v}}}$, and ${W^O} \in {{\mathbb {R}}^{h{d_v} \times {d_{model}}}}$. In this work, we employ h parallel attention heads.

Each of the first sublayers in the encoder contains a fully connected feed-forward network, which is applied to each position separately and identically. This consists of two linear transformations with a ReLU activation in between.

$$\begin{aligned} \begin{aligned} \begin{array}{l} FFN\left( x \right) = max\left( {0,x{W_1} + {b_1}} \right) {W_2} + {b_2} \end{array} \end{aligned} \end{aligned}$$

(5)

There is a residual connection after the multi-head attention and feed-forward network followed by layer normalization. That is, the output of each sub-layer is $LayerNorm \left( {x + Sublayer\left( x \right) } \right) $, where $Sublayer\left( x \right) $ is the function implemented by the sub-layer itself. To facilitate these residual connections, all sub-layers in the tower, as well as the embedding layers, produce outputs of dimension ${d_{model}}$.

The dimensionality of input and output of FFN is ${d_{model}}$, and the hidden-layer has dimensionality ${d_{hidden1}}$.

After the step-wise encoder we can obtain the ${d_{model}}$ dimensions output feature ${F_{se}} \in {{\mathbb {R}}^{T \times {d_{model}}}}$.

This encoder provides information to the LS(combining linear, softmax layers) network. The process is as follows:

$$\begin{aligned} \begin{aligned} {F_{so}} = LS({F_{se}}), {F_{so}} \in {{\mathbb {R}}^{d_{fusion}}} \end{aligned} \end{aligned}$$

(6)

We obtain the ${d_{fusion}}$ dimensions output feature ${F_{so}} \in {{\mathbb {R}}^{d_{fusion}}}$ from the step-wise tower, which is one of inputs to the tensor fusion layer.

Channel-wise tower

The Channel-wise Tower’s role is to discover dependencies among different distance channels. It is simply to implement the encoder by simply transpose the channel and time axis when feeding the time series for the channel-wise tower. Notably, since there is no relative or absolute correlation of the channel’s position in the multivariate time series, we did not include positional encoding in the Channel-wise Tower.

Similar to the step-wise tower, we obtain the ${d_{fusion}}$ dimension output feature ${F_{co}} \in {{\mathbb {R}}^{d_{fusion}}}$ from the step-wise tower, which is then used as one of the inputs for the tensor fusion layer.

Tensor fusion layer

Multivariate time series has multiple channels where each channel is a univariate time series. We aim to build a fusion layer that disentangles dynamics of step-wise tower, channel-wise tower and both of them by modeling each of them explicitly. Reference [52] introduce a novel model, termed Tensor Fusion Network, which learns both such dynamics end-to-end. We build a 2-D tensor fusion layer to capture hidden correlation between step-wise tower and channel-wise tower. The tensor fusion layer is defined as the following vector field using 2-fold Cartesian product:

$$\begin{aligned} \begin{aligned} \left\{ {({f_{co}},{f_{so}})|{f_{co}} \in \left[ {\begin{array}{*{20}{c}} {{F_{co}}}\\ 1 \end{array}} \right] ,{f_{so}} \in \left[ {\begin{array}{*{20}{c}} {{F_{so}}}\\ 1 \end{array}} \right] } \right\} \end{aligned} \end{aligned}$$

(7)

Step-wise tower produces the step-wise embedding $F_{co} \in {{\mathbb {R}}^{d_{fusion}}}$, $d_{fusion}$ is the output dimension of step-wise tower. Similarly, we get the channel-wise embedding $F_{co} \in {{\mathbb {R}}^{d_{fusion}}}$. The extra constant dimension with value 1 generates the unimodal and bimodal dynamics. Each neural coordinate $(f_{co},f_{so})$ can be seen as a 2-D point in the 2-fold Cartesian space defined by the embedding’s dimensions ${\left[ {F_{co}^{},1} \right] ^T},{\left[ {F_{so}^{},1} \right] ^T}$. ${F_g}$ is mathematically equivalent to a differentiable outer product between ${\left[ {F_{co}^{},1} \right] ^T},{\left[ {F_{so}^{},1} \right] ^T}$.

$$\begin{aligned} {F_g} = \left[ {\begin{array}{*{20}{c}} {F_{so}}\\ 1 \end{array}} \right] \otimes {\left[ {\begin{array}{*{20}{c}} {F_{co}}\\ 1 \end{array}} \right] ^T} \end{aligned}$$

(8)

In Eq. 8, $ \otimes $ signifies the outer product between vectors. ${F_g} \in {{\mathbb {R}}^{{(d_{fusion}+1)}^{2}}}$ represents the 2D feature matrix that includes all possible combinations of features from the two towers. The two subregions $F_{co},F_{so}$ are embeddings from the two towers of the tensor fusion layer. The subregion$F_{so} \otimes F_{co}^T$ captures bimodal interactions in our tensor fusion layer. The tensor fusion layer in our model is illustrated in Fig. 7.

Ultimately, as depicted in Fig. 5, we input the global features into the classification decoder, which consists of two cascaded feedforward neural networks: LR (combining linear and ReLU layers) and LS (combining linear and Softmax layers), to predict the final classification score. The dimensionality of output of each LR layer is ${d_{hidden2}}$. The class with the highest score is assigned as the class label.

Table 3 The partition of PD-Walk dataset

Full size table

Table 4 5-fold cross validation

Full size table

Experiments and results

Implementation details

We evaluated the performance of our model on the PD-Walk dataset collected by Ref. [26]. The proposed network was implemented using the PyTorch deep learning framework. During the experiment, the network was trained using Adagrad and the categorical cross-entropy was employed as the loss function.

Cross-validation and evaluation metrics

Cross-validation is a statistical method applied to assess the performance of a predictive model on an unknown dataset. We conducted 5-fold cross-validation on the PD-Walk dataset, ensuring that each sample was used in both the training and testing sets. Following the approach in Ref. [26], we divided the data into 5-folds, as shown in Table 3. We experimented in the same manner, training the model on 4 subsets (training sets) and testing it on the remaining subset (test set). The results of the 5-fold cross-validation are presented in Table 4, with an average identification rate of $86.8\% \pm 5.0\%$.

In classification problems, the primary performance index is accuracy, which is the proportion of correctly classified samples out of the total number of samples for a given data. For binary classification problems, the commonly used metrics are precision and recall. As these two measures can often be at odds, in disease diagnosis, more emphasis is placed on recall, which is the proportion of positive cases correctly identified as such among all actual positive cases. Meanwhile, the F1-Score, the harmonic mean of precision and recall, measures the overall efficacy of the classifier.

These metrics are defined as follows: Accuracy $=$ (TP $+$ TN)/(TP $+$ FN $+$ TN $+$ FP), Precision $=$ TP/(TP $+$ FP), Recall $=$ TP/(TP $+$ FN), F1 $=$ 2 $\times $ (Precision $\times $ Recall)/(Precision $+$ Recall), where TP, TN, FP, and FN represent the number of true positive, true negative, false positive and false negative samples, respectively. Therefore, accuracy, recall, F1-score were used as evaluation metrics to evaluate the classification results.

Sensitivity analysis

To further verify the effect of parameters on the performance, sensitivity analysis was conducted on model parameters and hyper-parameters. We present results of variations on model parameters in Table 5.

Table 5 Variations on model parameters

Full size table

On the whole, the sensitivity analysis of model parameters shows that the proposed model is insensitive to other parameter combinations around the benchmark parameter setting, which further proves the robustness of our model. Relatively speaking, the parameters (${d_{model}}$,${d_{hidden2}}$) have a greater degree of influence on the results.

Table 6 Variations on hyper-parameters

Full size table

We present results of variations on hyper-parameters in Table 6 and hyper-parameter LR(learning rate) a great influence on the results. The reason is that the performance of the learning rate is usually related to the scale of the model, and it is usually difficult to find the optimal parameter combination if the learning rate is too small or too large. The sensitivity analysis shows that our model has a certain stability.

Comparison with state-of-the-art methods

We compare the performance between different methods. From Table 7 results, it is evident that our model outperforms other approaches. Reference [26] evaluate three different methods on the PD-Walk dataset. As presented in the Table 7, the result of SVM using hand-crafted motion features serves as the baseline. ST-GCN achieves a better result than baseline. Reference [26] proposed ADGCN, which considers motion information and global connections, can get accuracy with 84.1%. In contrast, with almost the same configuration as our proposed network, GTN [49] method achieves 84.5% accuracy. Our model outperforms them with 86.8% accuracy, 88.4% recall and 87.8% F1 Score.

Table 7 Overall performance in accuracy, recall, and F1 score

Full size table

Visualization of the extracted feature

We visualize the result of the embedding layer by outputting the feature vector from the tensor fusion layer and the linear layer that follows it. We applied t-SNE to reduce the dimension for visualization, as shown in Figs. 8, 9 and 10. Each point in the graph is labeled with a corresponding color code. The labels on the left side of the figure correspond to the labels of the original data, while the labels on the right side correspond to the predicted results. It can be seen from Fig. 8 that before tensor fusion, all the points are clustered together without a clear division. After tensor fusion, the overall distribution of the data follows certain rules, and a clear division emerges between the labels of the predicted results, which is especially evident after the linear layer. These figures demonstrate that our model can project the data into a space where it is easily separable for improved classification results.

Visualization and analysis of the attention map

The attention matrix indicates the correlation between channels and time steps, respectively. We selected one sample to visualize the channel-wise attention map and the step-wise attention map.

The visualization of the channel-wise attention map for a sample is presented in Fig. 11. The intensity of the colors for channels 69 and 72 at epoch 300 is greater than at epoch 25, while the colors for channels 60 and 65 at epoch 300 are less intense than at epoch 25. As the number of epochs increases, the selectivity for salient features is enhanced, and the response values of the attention coefficients in salient areas become larger.

The visualization of the step-wise attention map is depicted in Fig. 12. Similar to the channel-wise attention map, the selectivity for salient features is enhanced, and the response values of the attention coefficients in salient areas become larger and sparser as the number of epochs increases. Additionally, we plot the raw time series of all channels in Fig. 13. We observed that the darker areas correspond to phases of the gait cycle, as shown in Fig. 13.

Furthermore, for the channel-wise attention map, we calculated the dynamic time war** (DTW) distance across the time series for different channels, as illustrated in Fig. 14. It appears that the attention map is somewhat consistent with dynamic time war**. Our analysis concentrates on the first row of the channel-wise attention map, which represents the attention scores obtained when treating channel 0 as a query. The attention scores for channels 60, 65, 69, and 72 are larger than those for the others. We plotted the raw time series for 9 channels in Fig. 15 and found that channels 60, 65, 69, and 72, which have higher attention scores, fluctuated more than channels 1, 2, 3, and 4, which have lower attention scores, when channel 0 was used as a query.

For each time step, we also simply calculated the Euclidean distance between different channels, as shown in Fig. 16. Since there is no time axis at the same time step, DTW is not required. The visualization is displayed in Fig. 15. Similarly, our analysis focuses on the first row of the step-wise attention map, which represents the attention score obtained when treating step 0 as a query. In Fig. 15, channel 72 demonstrates a downward trend at step 0, which is consistent with the trend observed at steps 40–48 where the time steps have higher attention scores. The latter steps, when used as a query, show some correlation with the preceding steps and are less related to subsequent steps, which aligns with the actual situation. As is known, the gait sequence exhibits a certain periodicity, and the initial state reflects the stage of the gait.

Analysis on different channel pairs

We conducted experiments to explore the role of different channel pairs in recognizing Parkinson’s disease. Joints 11–17, which are related to human lower limb gait, form distance pairs numbered 58–78, resulting in 210 channel pairs. We studied different channel pairs separately to determine the best channel pair for our network’s predictions. The results are displayed in Table 8, where we have retained only the first ten pairs and the last ten pairs. The top channel pair achieved an accuracy of 86.11%, and its visualization is shown in Fig. 17. We discovered that it involves a combination of the right lower limb joints. Similarly, the visualization of the least effective pair is shown in Fig. 18. The combination of numbers 63 and 68 corresponds to the upper body structure, which is more stable.

Additionally, we recorded and visualized the frequency of different combinations in the top 10 channel pairs, as shown in Table 9 and Fig. 19. The combination of joints 11 and 15 occurred with the highest frequency, aligning with the findings in Fig. 17. We also documented and visualized the frequency of combinations in the bottom 10 channel pairs, as shown in Table 10 and Fig. 20. The combination of joints 12 and 17, which is part of the upper body structure, had the highest frequency, corroborating the results in Fig. 18. These findings suggest that lower limb movement is crucial for Parkinson’s disease identification on a broader scale.

Table 8 Overall performance on different channel pairs

Full size table

Furthermore, relevant medical literature [50] reports that the limb swing stride of Parkinson’s disease patients is asymmetrical. To study the asymmetry of lower limb swing between healthy individuals and patients, and to validate our preliminary observations of abnormal lower limb movement in Parkinson’s disease, we examined the significance of node combinations related to human limb structure in both groups. The significance test results are presented in Table 11. It was observed that the distance distribution of the left and right upper arms in Parkinson’s showed no significant difference ($p=0.661>0.05$), which was consistent with that of healthy individuals. In contrast, the distance distribution of the left and right lower arms exhibited a significant difference ($p=0.03<0.05$), signifying asymmetrical movement in Parkinson’s patients compared to healthy individuals. The correlation distance distribution of lower limbs in Parkinson’s disease were not significantly different ($p>0.05$). However, by comparing P-values between healthy individuals and patients, we noted the largest discrepancy in the lower limbs, supporting our conclusions. Movement differences in the lower extremities of Parkinson’s patients are more pronounced than in healthy individuals.

Therefore, based on these facts, we can conclude that the gait symptoms of Parkinson’s disease are primarily manifested through lower limb movement, particularly in the right leg.

Application of the model

In the application of the model, the skeleton joints obtained by any camera video only needs to do affine transformation to be used as the input of our model. Affine transformation refers to the relationship between a linear transformation (multiplying by a matrix) and a translation (adding a vector) in a vector space to another space. The affine transformation represents the map** between two graphs [51]. The affine transformation matrix M in Eq. 9 is a $2\times 3$ matrix, matrix B represents the translation, while the diagonal element in matrix A determines the scaling, and the anti-diagonal element determines the rotation.

$$\begin{aligned} \begin{aligned} A =&\left[ {\begin{array}{*{20}{c}} {{a_{00}}}&{}{{a_{01}}}\\ {{a_{10}}}&{}{{a_{11}}} \end{array}} \right] ,B = \left[ {\begin{array}{*{20}{c}} {{b_{00}}}\\ {{b_{10}}} \end{array}} \right] \\ M =&[A,B] = \left[ {\begin{array}{*{20}{c}} {{a_{00}}}&{}{{a_{01}}}&{}{{b_{00}}}\\ {{a_{10}}}&{}{{a_{11}}}&{}{{b_{10}}} \end{array}} \right] \end{aligned} \end{aligned}$$

(9)

The original pixel point coordinate (x, y) after the affine transformation becomes the (u, v). The transformation formula is as follows:

$$\begin{aligned} \begin{aligned} \left[ {\begin{array}{*{20}{c}} u\\ u \end{array}} \right] = A \cdot \left[ {\begin{array}{*{20}{c}} x\\ y \end{array}} \right] + B \end{aligned} \end{aligned}$$

(10)

The relationship between different image pixels can be obtained by affine transformation. In addition, the skeleton used in our model are from the videos of subjects walking back and forth facing the camera. So, the inverse diagonal elements of A in the affine transformation matrix equal to 0 if the subject is walking back and forth towards any camera, besides, the distances of skeleton joints are used as the inputs of our model, so the elements of matrix B will not affect the result. Therefore, only the anti-diagonal elements influence the pixel’s relationship, which represents the pixel-scale relationship of the two pictures. Beisdes, we can divide the skeleton tensor into several fragments and send it to the network. Each fragment is examined, and the results of these fragments are evaluated comprehensively. Therefore, this shows that our model can be widely applied.

Table 9 Frequency of different combinations in top10 channel pairs

Full size table

Table 10 Frequency of different combinations in last10 channel pairs

Full size table

Table 11 The results of significance test

Full size table

Conclusions

In this paper,to realize automated quantitative assessment of gait motor disorder in PD patients using gait videos, we developed a network called the twin-tower Transformer network with tensor fusion for skeleton sequences of individuals with Parkinson’s disease and healthy controls to detect early Parkinson’s disease. The task was transformed into a multivariate time series classification task by calculating the joint distances. Specifically, the spatial distances and temporal dynamics of the joints were modeled by the encoder of transformer network, besides, the tensor fusion layer was used to uncover the influence of step features, channel features, and their fusion features. Moreover, we conducted comprehensive experiments on a dataset, and the preliminary results indicate that our network can achieve state-of-the-art performance with an 86.8% accuracy. Additionally, we performed visual analyses to enhance the interpretability of our model. Our experiments on different channel pairs in the recognition of Parkinson’s disease and the significance testing led us to conclude that the gait symptoms of Parkinson’s disease are predominantly characterized by lower limb movement, especially in the right leg. Furthermore, our research extends the modeling of skeletons and the detection of Parkinson’s disease using a Transformer network.

Data Availability

The data that support the findings of this study are openly available in figshare at https://figshare.com/articles/dataset/PDWalk_rar/19196138, reference number [26].

References

Parkinson J (2002) An essay on the shaking palsy. J Neuropsychiatry Clin Neurosci 14(2):223–236. https://doi.org/10.1176/appi.neuropsych.14.2.223
Article Google Scholar
Poewe W, Seppi K, Tanner CM, Halliday GM, Brundin P, Volkmann J, Schrag AE, Lang AE (2017) Parkinson disease. Nat Rev Dis Primers 3(1):1–21. https://doi.org/10.1038/nrdp.2017.13
Article Google Scholar
Medeiros L, Almeida H, Dias L, Perkusich M, Fischer R (2016) A gait analysis approach to track P. In: Proceedings of IEEE 2016 IEEE 29th international symposium on computer-based medical systems (CBMS), Belfast and Dublin, Ireland 20–24 June 2016, pp 48–53. https://doi.org/10.1109/cbms.2016.14
Veeraragavan S, Gopalai AA, Gouwanda D, Ahmad SA (2020) Parkinson’s disease diagnosis and severity assessment using ground reaction forces and neural networks. Front Physiol 11:142–149. https://doi.org/10.3389/fphys.2020.587057
Article Google Scholar
Maachi IE, Bilodeau GA, Bouachir W (2020) Deep 1d-convnet for accurate Parkinson disease detection and severity prediction from gait. Expert Syst Appl 143(Apr.):113075.1-113075.7. https://doi.org/10.1016/j.eswa.2019.113075
Canturk I (2021) A computerized method to assess Parkinson’s disease severity from gait variability based on gender. Biomed Signal Process Control 2021(Apr.):66. https://doi.org/10.1016/j.bspc.2021.102497
Ertugrul OF, Kaya Y, Tekin R, Almali MN (2016) Detection of Parkinson’s disease by shifted one dimensional local binary patterns from gait. Expert Syst Appl 56(Sep.):156–163. https://doi.org/10.1016/j.eswa.2016.03.018
Hsieh YL, Abbod MF (2021) Gait analyses of Parkinson’s disease patients using multiscale entropy. Electronics 10(21):2604. https://doi.org/10.3390/electronics10212604
Article Google Scholar
Ertugrul OF, Kaya Y, Tekin R, Almali MN (2016) Detection of Parkinson’s disease by shifted one dimensional local binary patterns from gait. Expert Syst Appl 56:156–163. https://doi.org/10.1016/j.eswa.2016.03.018
Article Google Scholar
Daliri MR (2013) Chi-square distance kernel of the gaits for the diagnosis of Parkinson’s disease. Biomed Signal Process Control 8(1):66–70. https://doi.org/10.1016/j.bspc.2012.04.007
Article Google Scholar
Sarbaz Y, Towhidkhah F, Gharibzadeh S, Jafari A (2012) Gait spectral analysis: an easy fast quantitative method for diagnosing Parkinson’s disease. J Mech Med Biol 12(03):529. https://doi.org/10.1142/S0219519411004691
Article Google Scholar
**a Y, Gao Q, Ye Q (2015) Classification of gait rhythm signals between patients with neuro-degenerative diseases and normal subjects: experiments with statistical features and different classification models. Biomed Signal Process Control 18:254–262. https://doi.org/10.1016/j.bspc.2015.02.002
Article Google Scholar
Drotár P, Mekyska J, Rektorová I, Masarová L, Faundez-Zanuy M (2014) Analysis of in-air movement in handwriting: a novel marker for Parkinson’s disease. Comput Methods Programs Biomed 117(3):405–411. https://doi.org/10.1016/j.cmpb.2014.08.007
Article Google Scholar
Pereira CR, Pereira DR, Silva FA et al (2016) A new computer vision-based approach to aid the diagnosis of Parkinson’s disease. Comput Methods Prog Biomed 136:79–88. https://doi.org/10.1016/j.cmpb.2016.08.005
Article Google Scholar
Pereira CR, Weber SAT, Hook C et al (2016) Deep learning-aided Parkinson’s disease diagnosis from handwritten dynamics. In: Proceedings of IEEE 2016 29th SIBGRAPI conference on graphics, patterns and images (SIBGRAPI), Sao Paulo, Brazil, 4–5 October 2016, pp 340–346
Poonam Z, Kumar DK, Peter D, Sridhar PA, Sanjay R (2017) Distinguishing different stages of Parkinson’s disease using composite index of speed and pen-pressure of sketching a spiral. Front Neurol 8:435. https://doi.org/10.3389/fneur.2017.00435
Article Google Scholar
Gerger M, Abdülkadir Gümüü (2022) Diagnosis of Parkinson’s disease using spiral test based on pattern recognition. Sci Technol 25(1):100–113
Google Scholar
Salarian A, Russmann H, Vingerhoets FJG, Burkhard PR, Aminian K (2007) Ambulatory monitoring of physical activities in patients with Parkinson’s disease. IEEE Trans Biomed Eng 54(12):2296–2299. https://doi.org/10.1109/tbme.2007.896591
Article Google Scholar
Santos D, Neto MF, Lemos MR, Silva V,Junior V (2019) Wearable system for early identification of Parkinson’s disease symptoms through the evaluation of the gait training. In: 2019 IEEE 9th international conference on consumer electronics ICCE-Berlin. IEEE, pp 51–56
Li B, Yao Z, Wang J, Wang S, Yang X, Sun Y (2020) Improved deep learning technique to detect freezing of gait in Parkinson’s disease based on wearable sensors. Electronics 9(11):1919. https://doi.org/10.3390/electronics9111919
Article Google Scholar
Locatelli P, Alimonti D, Traversi G, Re V (2020) Classification of essential tremor and Parkinson’s tremor based on a low-power wearable device. Electronics 9(11):1695. https://doi.org/10.3390/electronics9101695
Article Google Scholar
Guo Y, Yang J, Liu Y et al (2022) Detection and assessment of Parkinson’s disease based on gait analysis: a survey. Front Aging Neurosci 14:916971
Article Google Scholar
Chen YY, Cho CW, Lin SH et al (2012) A vision-based regression model to evaluate Parkinsonian gait from monocular image sequences. Expert Syst Appl 39(1):520–526. https://doi.org/10.1016/j.eswa.2011.07.042
Article Google Scholar
Sabo A, Mehdizadeh S, Ng KD et al (2020) Assessment of Parkinsonian gait in older adults with dementia via human pose tracking in video data. J Neuroeng Rehabil 17(1):1–10. https://doi.org/10.1186/s12984-020-00728-9
Article Google Scholar
Guo R, Shao X, Zhang C et al (2021) Multi-scale sparse graph convolutional network for the assessment of Parkinsonian gait. IEEE Trans Multimedia 24:1583–1594. https://doi.org/10.1109/TMM.2021.3068609
Article Google Scholar
He Y, Yang T, Yang C, Yang C, Zhou H (2022) Integrated equipment for Parkinson’s disease early detection using graph convolution network. Electronics 11(7):1154. https://doi.org/10.3390/electronics11071154
Article Google Scholar
Lee T, Jeon ET, Jung JM et al (2022) Deep-learning-based stroke screening using skeleton data from neurological examination videos. J Pers Med 12(10):1691. https://doi.org/10.3390/jpm12101691
Article Google Scholar
Li G, Pun CM, Li H et al (2023) An optimized-skeleton-based Parkinsonian gait auxiliary diagnosis method with both monitoring indicators and assisted ratings. In: 2023 IEEE international conference on bioinformatics and biomedicine (BIBM). IEEE, pp 2011–2016
Naseem MT, Seo H, Kim NH et al (2024) Pathological gait classification using early and late fusion of foot pressure and skeleton data. Appl Sci 14(2):558
Article Google Scholar
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser L (2017) Attention is all you need. ar**v preprint https://doi.org/10.48550/ar**v.1706.03762
Wei T, Li X, Stojanovic V (2021) Input-to-state stability of impulsive reaction-diffusion neural networks with infinite distributed delays. Nonlinear Dyn 103:1733–1755. https://doi.org/10.1007/s11071-021-06208-6
Article Google Scholar
Song X, Wu N, Song S et al (2023) Bipartite synchronization for cooperative-competitive neural networks with reaction-diffusion terms via dual event-triggered mechanism. Neurocomputing 550:126498. https://doi.org/10.1016/j.neucom.2023.126498
Article Google Scholar
Song X, Wu N, Song S et al (2023) (2023) Switching-like event-triggered state estimation for reaction-diffusion neural networks against DoS attacks. Neural Process Lett 55(7):8997–9018
Article Google Scholar
Cao Z, Simon T, Wei SE, Sheikh Y (2017) Realtime multi-person 2D pose estimation using part affinity fields. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017, pp 7291–7299. https://doi.org/10.1109/TPAMI.2019.2929257
Fang HS, **e S, Tai YW, Lu C (2017) RMPE: regional multi-person pose estimation. In: Proceedings of the IEEE international conference on computer vision (ICCV), Venice, Italy, 22–29 October 2017, pp 2334–2343. https://doi.org/10.1109/ICCV.2017.256
Sun K, **ao B, Liu D, Wang J (2019) Deep high-resolution representation learning for human pose estimation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019, pp 5693–5730
Zhou T, Wang W, Liu S, Yang Y, Gool LV (2021) Differentiable multi-granularity human representation learning for instance-aware human semantic parsing. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), Nashville, TN, USA, 20–25 June 2021, pp 1622—1631. https://doi.org/10.48550/ar**v.2103.04570
Yan S, **ong Y, Lin D (2018) Spatial temporal graph convolutional networks for skeleton-based action recognition. In: Proceedings of the AAAI, New Orleans, LA, USA, 2–7 Feb 2018, vol 32(1). https://doi.org/10.48550/ar**v.1801.07455
Shi L, Zhang Y, Cheng J, Lu H (2019) Two-stream adaptive graph convolutional networks for skeleton-based action recognition. In: Proceedings of the 2019 IEEE/CVF conference on computer vision and pattern recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019, pp 12018–12027. https://doi.org/10.48550/ar**v.1805.07694
Han Y, Chung SL, **ao Q, Lin WY, Su SF (2020) Global spatio-temporal attention for action recognition based on 3d human skeleton data. EEE Access 8:88604–88616. https://doi.org/10.1109/ACCESS.2020.2992740
Article Google Scholar
Li M, Chen S, Chen X, Zhang Y, Wang Y, Tian Q (2019) Actional-structural graph convolutional networks for skeleton-based action recognition. In: Proceedings of the 2019 IEEE/CVF conference on computer vision and pattern recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019, pp 3590–3598. https://doi.org/10.1109/CVPR.2019.01230
**ong X, Min W, Zheng WS, Liao P, Yang H, Wang S (2020) S3d-CNN: skeleton-based 3d consecutive-low-pooling neural network for fall detection. Appl Intell 50:3521–3534
Article Google Scholar
Xue H, Hongying Z, **uwen LU et al (2023) Three-stream head pose estimation algorithm based on multi-stage feature fusion. Comput Eng Appl 59(17):212–222. https://doi.org/10.3778/j.issn.1002-8331.2204-0069
Article Google Scholar
Yang Z, Dai Z, Yang Y (2019) XLNet: generalized autoregressive pretraining for language understanding. In Proceedings of the 33rd international conference on neural information processing systems (NIPS), Dec 2019, pp 5753—5763. https://doi.org/10.48550/ar**v.1906.08237
Zhao H, Jia J, Koltun V (2020) Exploring self-attention for image recognition. In: Proceedings of the 2020 IEEE/CVF conference on computer vision and pattern recognition (CVPR), Seattle, WA, USA, 13–19 June 2020, pp 10073–10082
Hu H, Zhang Z, **e Z (2019) Local relation networks for image recognition. In: Proceedings of the 2019 IEEE/CVF international conference on computer vision (ICCV), 2019, Seoul, Korea (South), 27 Oct–02 Nov 2019, pp 3463–3472. https://doi.org/10.1109/ICCV.2019.00356
Li S, ** X, Xuan Y, Zhou X, Chen W, Wang YX, Yan XF (2020) Enhancing the locality and breaking the memory bottleneck of transformer on time series forecasting. ar**v preprint https://doi.org/10.48550/ar**v.1907.00235
Oh J, Wang J, Wiens J (2018) Learning to exploit invariances in clinical time-series data using sequence transformer networks. ar**v preprint https://doi.org/10.48550/ar**v.1808.06725
Liu M, Ren S, Ma S, Jiao J, Chen Y, Wang Z, Song W (2018) Gated transformer networks for multivariate time series classification. ar**v preprint https://doi.org/10.48550/ar**v.2103.14438
Djaldetti R, Ziv I, Melamed E (2006) The mystery of motor asymmetry in Parkinson’s disease. Lancet Neurol 5(9):796–802
Article Google Scholar
Hartley R, Zisserman A (2003) Multiple view geometry in computer vision. Cambridge University Press, Cambridge
Google Scholar
Zadeh A, Chen M, Poria S, Cambria E, Morency LP (2017) Tensor fusion network for multimodal sentiment analysis. ar**v preprint https://doi.org/10.48550/ar**v.1707.07250

Download references

Acknowledgements

This work is supported by the Major Science and Technology Program of Henan Province under Grant No. 221100210500, the Central Government Guiding Local Science and Technology Development Fund Program of Henan Province under Grant No. Z20221343032, and the National Natural Science Foundation of China under Grant No. 61672210.

Author information

Lan Ma, Hua Huo, Wei Liu, Changwei Zhao, **xuan Wang and Ningya Xu contributed equally to this work.

Authors and Affiliations

College of Information Engineering, Henan University of Science and Technology, Kaiyuan Street, Luoyang, 471023, China
Lan Ma, Hua Huo, Wei Liu, Changwei Zhao, **xuan Wang & Ningya Xu
Engineering Technology Research Center of Big Data and Computational Intelligence, Henan University of Science and Technology, Kaiyuan Street, Luoyang, 471023, China
Hua Huo

Authors

Lan Ma
View author publications
You can also search for this author in PubMed Google Scholar
Hua Huo
View author publications
You can also search for this author in PubMed Google Scholar
Wei Liu
View author publications
You can also search for this author in PubMed Google Scholar
Changwei Zhao
View author publications
You can also search for this author in PubMed Google Scholar
**xuan Wang
View author publications
You can also search for this author in PubMed Google Scholar
Ningya Xu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Hua Huo.

Ethics declarations

Conflict of interest

The author(s) declare(s) that there is no Conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Ma, L., Huo, H., Liu, W. et al. Twin-tower transformer network for skeleton-based Parkinson’s disease early detection. Complex Intell. Syst. (2024). https://doi.org/10.1007/s40747-024-01507-y

Download citation

Received: 14 December 2023
Accepted: 15 May 2024
Published: 21 June 2024
DOI: https://doi.org/10.1007/s40747-024-01507-y

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Twin-tower transformer network for skeleton-based Parkinson’s disease early detection

Abstract

Similar content being viewed by others

Learning rich features for gait recognition by integrating skeletons and silhouettes

Sliding Covariance Matrix: Co-learning Spatiotemporal Geometry Feature for Skeleton Based Action Recognition

An adaptively multi-correlations aggregation network for skeleton-based motion recognition

Introduction

Related work