Introduction

Parkinson’s disease (PD) was first characterized by James Parkinson, a prominent British medical scientist, in 1817. It is a degenerative disorder of the nervous system marked by a range of motor and non-motor symptoms [1]. The main cause of PD is the loss of dopaminergic neurons in the substantia nigra of the brain. However, early-stage PD patients may not exhibit noticeable clinical symptoms, which become evident only when there is significant neuronal loss, exceeding 60%. As such, there is no cure for PD, and early detection for timely intervention is critical.

The disease affects patients’ motor and non-motor systems, significantly impacting their daily lives. Motor symptoms, including resting tremor, bradykinesia, rigidity, and postural and gait instability, are the primary signs of PD. Specific motor symptoms such as resting tremor, bradykinesia, rigidity, and postural and gait instability are the four cardinal symptoms of PD [2]. Postural and gait issues can include freezing of gait, balance instability, falls, shuffling steps, and difficulty turning. Bradykinesia and freezing of gait typically occur in the middle and late stages of PD, while small steps, unsteadiness, and falls are more prevalent in the later stages [2].

Currently, early PD detection primarily depends on the observation of clinical symptoms. Treatment varies according to symptoms, and doctors often base their approach on the Unified Parkinson’s Disease Rating Scale and the Hoehn and Yahr scales, which are subjective and can evaluate movement, daily living abilities, disease progression, post-treatment status, and treatment side effects and complications. However, this method of diagnosis is largely dependent on the clinician’s experience, which can be subjective and inaccurate.

To summarize, clinical practice using these scales faces two significant problems: 1) Patients may miss early intervention opportunities due to late treatment, with clinical symptoms possibly being confounded by other diseases that cause abnormal gait; 2) The diagnosis of PD by doctors is too reliant on subjective scale scores, lacking objective indicators. We need an efficient and convenient way for doctors to diagnose PD and treat it according to its severity.

Parkinson’s disease patients exhibit distinctive gait characteristics; therefore, the automated quantification of gait motor disorders is crucial for the automated assessment of motor functions in PD patients. It is feasible to use sensors to capture patients’ movement information and to utilize sensor signals as an auxiliary means of diagnosing and treating Parkinson’s disease. With the advancement of deep learning, recent developments in artificial intelligence have offered new opportunities for the rapid advancement of early detection of Parkinson’s disease. Researchers have proposed numerous methods for the automatic prediction and evaluation of the disease’s pathological stage using machine learning. We categorize PD detection methods based on whether the sensors are in direct contact with the subject. Contact sensors include plantar pressure sensors, sensor-equipped pens, and wearable motion capture devices, among others. Non-contact sensor methods primarily rely on gait analysis through video.

PhysioNet is a widely-used dataset that collects gait signals via 16 pressure sensors placed under each foot. Many researchers [3,4,5,6,7,8] have studied this dataset with promising results. For gait feature detection, Refs. [9,10,11,12] employed feature extraction methods. References [9, 12] established a predictive pattern between normal gait and the gait of Parkinson’s disease patients based on timing signal detection, while Refs. [10, 11] derived a pattern based on signal frequency.

Regarding sensor-equipped pens, there are four Parkinson’s handwriting datasets, including the PaHaW dataset [13], HandPD dataset [14], NewHandPD dataset [15] and Parkinson’s Drawing Dataset [16]. These datasets record the process of participants writing patterns or words by hand with sensor-equipped pens. Subsequently, many researchers [16, 17] have conducted studies on these datasets. A wearable motion capture device-based method is proposed in [18], which captures human motion data using pressure or acceleration sensors and performs statistical analysis to extract motion features. Some researchers [19,20,21] have even developed wearable systems to detect PD symptoms. However, most sensors should be in contact with the patient’s body and assembled in clinics with the help of trained experts, which inevitably affects the assigned movements of PD patients. Besides, such rich and accurate data highly depends on laboratory-based experiments, which brings difficulties for routine assessments. Thus, the practical application of the sensor-based scheme to clinical applications has been limited.

Since PD affects many parts of the body and causes a wide range of motion problems, an accurate assessment of PD severity relies on a comprehensive analysis of human motion features. With the rapid development of computer vision, the non-contact sensor method based on gait video have been proposed to detect PD. The advantage of video-based diagnosis methods for Parkinson’s disease is that patients do not need to wear various sensors.

Reference [22] presents the evidence that demonstrates the relationship between human gait and PD, and illustrates the role of different gait analysis systems based on vision or wearable sensors. To the best of our knowledge, Refs. [23,24,25,26,27,28,29] have proposed vision-based methods for the automated assessment of gait motor disorder. Reference [23] develop a PD gait regression model that is capable of predicting the severity of motor dysfunction from given gait image sequences. Reference [24] use a multivariate ordinal logistic regression method and the relative contribution of each gait feature for regression to UPDRS-gait and SAS-gait scores was assessed. Reference [25] propose a novel two-stream spatial-temporal attention graph convolutional network for video assessment of PD gait motor disorder. However, these methods only adopt traditional machine learning methods or improve action recognition methods, and lack the spatiotemporal analysis of gait data. Reference [26] introduced the first Parkinson’s disease gait dataset and proposed an end-to-end method for the early diagnosis of Parkinson’s disease based on a graph convolutional network. This approach takes the patient’s skeleton sequences as input and returns the diagnostic results.

We address the limitations of vision-based methods by establishing a quantification technique for the automated assessment of gait motor disorder in PD patients using video-based computer vision technology. Specifically, we extract joint sequences from videos using a human pose estimation algorithm and convert the skeleton sequence into a multivariate time series by calculating the distance between joints. Traditional skeleton joint recognition models take the original skeleton tensor as input to explore the relationship between joints. In contrast, we use the distances between skeleton joints as input, inspired by the simple mathematical principle that if the three sides of two triangles are equal, the triangles are congruent. The human skeleton is comprised of joints, and the distances between these joints reflect the skeleton’s shape to some extent. This approach allows us to disregard the camera’s relative position to the human skeleton and focus on the changes within the skeleton itself.

However, the quantitative assessment based on multivariate time series of distances also faces three significant challenges. Current classification algorithms for multivariate time series problems tend to model single-channel data and combine multi-channel results with simple linear weighting, neglecting the interactions between channels. The extraction of channel and step features from multivariate time series is crucial for multivariate time classification. Traditional fusion methods typically employ linear fusion or set gating, so the integration of step and channel features in multivariate time series remains a critical issue.

Some new methods on neural network can be used to deal with multivariate time series classification tasks, the Transformer network [30] was originally designed to model tasks in machine translation, which has a good ability to encode data related to time series. References [31,32,33] focus on the global existence–uniqueness and input-to-state stability of the mild solution of impulsive reaction–diffusion neural networks with infinite distributed delays, which can be used in the time–space sampled-data scheme. These methods bring new ideas for us to deal with multivariate time series tasks.

In this paper, we use skeleton sequences extracted from video to analyze patients’ movement symptoms, thereby detecting Parkinson’s disease. We firstly introduce transformer network into Parkinson’s disease recognition and propose a transformer network with a tensor fusion layer for the classification of multiple time series. To explore the spatial and temporal relationships in multiple time series, we leverage the multi-head attention mechanism of the transformer model’s encoder and construct two separate towers. A tensor fusion layer is constructed to model and fuse the spatial, temporal, and cross-spacetime features for gait. We deploy the model on the dataset collected by Ref. [26] and analyse the step and channel features of gait with attention maps. This algorithm excels at distinguishing between healthy individuals and those with PD and can be used in clinical practice to aid doctors in diagnosing Parkinson’s disease. Consequently, it facilitates early intervention for patients and reduces diagnosis and treatment costs. Therefore, the method we propose holds substantial practical value.

To summarize, our main contributions are as follows:

  • We focus on utilizing skeleton sequence information to determine whether subjects have Parkinson’s disease. To address the problems mentioned above, we first transform the coordinate sequences of the skeleton into multivariate time series of joint distances. We then apply a multivariate time series classification algorithm to address the issue, providing a novel approach for classification based on skeleton sequences.

  • We developed a Transformer network to fulfill the task of multivariate time series classification. Current classification algorithms for multivariate time series problems simply model single-channel data and combine multi-channel results with simple linear weighting, paying less attention to the interaction between channels. To explore the channel and step relationship of joint distances’ series, we make full use of the multi-head attention mechanism of the encoder in the transformer model and build two towers. For the channel-wise tower, make full use of the advantages of the attention mechanism to explore the relationship between channels, while for the step-wise tower, we use an attention mechanism with positional encoding to learn step features of the distances.

  • Unlike conventional methods that rely on linear fusion or set gating, we innovatively employ tensor fusion. This fusion method’s advantage lies in its ability to comprehensively uncover the influence of step features, channel features, and their fusion features.

The rest of this paper is organized as follows: “Related work” section reviews related existing work. “Methods” section introduces the proposed method, and “Experiments and results” section presents the experimental system, results, as well as qualitative and quantitative analyses of the findings. The application of the model is discussed in “Application of the model” section. Finally, the study concludes in “Conclusions” section.

Related work

Skeleton based action recognition on PD detection

In recent years, Parkinson’s disease recognition methods have received more and more attention for the important clinical value of early diagnosis of Parkinson’s disease. There are a lot of algorithms for identifying Parkinson’s disease using plantar pressure [13,14,15,16,17,18], but there are few algorithms for identifying Parkinson’s disease based on Skeletons. With the development of the analysis methods of human motion, it is possible to identify Parkinson’s disease by skeletons. High-precision human pose estimation algorithms, such as OpenPose [34], AlphaPose [36] was used to extract human skeletons from each video clip. The dataset ultimately encompasses the coordinates of 17 skeleton joints obtained by HRNet [36] before and after video cutting. The extracted human skeleton from the video is illustrated in Fig. 2. Joint No.17 is the midpoint between joints No.5 and No.6.

Fig. 2
figure 2

Human skeleton extracted from the video

Preprocessing is a critical step in pattern recognition and machine learning. Following Ref. [26], the joints representing the head were removed to prevent overfitting. After completing these operations, we obtain a skeleton tensor \({M^{T \times V \times C}}\) from each video clip, where T represents the number of video frames, V the number of joints, and C the location coordinate dimension. Specifically, we have \(T=72\), \(V=13\), and \(C = 2\) (i.e., \(x\mathrm{{ }}\) and \(\mathrm{{ }}y)\). The visualization of a data sample is shown in Fig. 3.

Fig. 3
figure 3

A visualization of a sample in PD-Walk dataset

Table 2 Joint combinations

After completing the data preprocessing, we transform the coordinate sequences of the skeleton into a multivariate time series by calculating the Euclidean distances between every pair of joints. The list of joint combinations is displayed in Table 1. We then obtain \(CO=C_{13}^2=78\) series based on Eq. 1:

$$\begin{aligned} \begin{aligned}&{P_{t,v}} = ({M_{t,v,{c_1}}},{M_{t,v,{c_2}}}), t \in [1,T], {v} \in [1,V],{M_{t,v,c}} \in M \\&co = Combination({v_i},{v_j}), {v_i} \in [1,V],\\&\quad {v_j} \in [1,V] \backslash {v_i},{v_i} < {v_j}, co \in [1,CO]\\&L_{t, co}= \left| {{P_{t,v_i}}{{P}_{t,v_j}}} \right| \\&\quad = \sqrt{{{\left( {{M_{t,v_i,c_1}} - {{M}_{t,v_j,c_1}}} \right) }^2} + {{\left( {{M_{t,v_i,c_2}} - {{M}_{t,v_j,c_2}}} \right) }^2}}\\ \end{aligned} \end{aligned}$$
(1)

\({P_{t,v}}\) represents the coordinates of the skeleton joints, and co represents the number of such joint combinations. According to Eq. 1, we can derive the multivariate time series matrix \(L \in {{\mathbb {R}}^{T \times {CO}}}\), which serves as the input for our network.

The visualization of the standardized multivariate time series is shown in Fig. 4. From the figure, it is evident that the multivariate time series exhibits a certain periodicity. The twin-tower Transformer network is specifically designed for such multivariate time series. We standardized the data before inputting it into our network. The structure of the network for this input is described in “Twin-tower transformer network” section.

Fig. 4
figure 4

Multivariate time series of a sample in PD-Walk dataset

Fig. 5
figure 5

Model architecture of the twin-tower transformer network

Twin-tower transformer network

Our twin-tower Transformer network design builds upon improvements from various domains and tasks. We have developed frameworks that leverage the inherent sequential invariance of Transformers and their ability to learn features through attention mechanisms. The traditional Transformer was originally designed for machine translation tasks and features an encoder–decoder structure. Reference [49] investigated a simple extension of the current Transformer Networks with gating, named Gated Transformer Networks (GTN), for the multivariate time series classification problem. They introduced three extensions to adapt the Transformer to their needs: embedding, two towers, and gating. Following their research, we constructed two towers to learn the relationships within and between channels. Unlike GTN, we added a linear layer after each tower to capture the features of each tower separately. Furthermore, we introduced a tensor fusion layer that enables the model to fully understand the relationships of each tower. Our model architecture is depicted in Fig. 5.

Our model has a overall architecture using step-wise tower, channel-wide tower, tensor fusion layer, two cascaded feedforward neural networks LR (combining linear, ReLU layers) and LS (combining linear, Softmax layers), shown in Fig. 5. Each tower has a overall architecture using input embedding module, stacked encoder and a LS layer. Especially, positional encoding in step-wise tower indicate time sequence. Following the overall architecture [30], each encoder has a multi-head self-attention mechanism and a simple, positionwise fully connected feed-forward network. A residual connection was employed around each of the two sub-layers, followed by layer normalization.

Step-wise tower

Our step-wise tower has a overall architecture using input embedding module, positional encoding, stacked encoder, and a LS(combining linear, Softmax layers) layer, shown in Fig. 5.

Input embedding module aims to place step-wise input closer together in the embedding space. Specifically, we use a neural network consisting of a fully connected layer embed the step-wise input \(L \in {{\mathbb {R}}^{T \times {CO}}}\) into a \({d_{model}}\)-dimensional space \({F_{sem}} \in {{\mathbb {R}}^{T \times {d_{model}}}}\), the input embedding module having a \({d_{model}}\)-dimensional output.

The traditional Transformer uses a positional encoding module to represent word order in the natural language. This can distinguish the position of the same word in different positions, reflecting the positional relationship between words. In our framework, we use a positional encoding module to distinguish the step of time series.

The encoders in step-wise tower explicitly capture the step-wise correlation of distance sequences by attention and masking, as shown in Fig. 5. To encode the temporal feature of distance sequences, we use the self-attention with mask to attend on each point cross all the distance channels by calculating the pair-wise attention weights among all the time steps. In the multi-head self-attention layers, the scaled dot-product attention formulates the attention matrix on all time step. Hence, temporal dependencies are first retrieved through the encoders in step-wise tower. The LS layer after stacked encoder is performed to be used as a dimension reduction, from which we obtain the output feature of the Step-wise Tower, and it plays one of inputs to tensor fusion layer.

Our encoder follows the overall architecture of encoder in traditional Transformer. As shown in Fig. 6, we use a self-attention mechanism approach. Self-attention, described in the original Transformer [30], is a mechanism for calculating semantic correlations between different items in a data series. According to the terms in [30], let Q, K, and V be the query matrix, key matrix, and value matrix generated by the linear transformation of the input features \({F_{sem}} \in {{\mathbb {R}}^{T \times {d_{model}}}}\) respectively, as follows:

$$\begin{aligned} \begin{aligned} \begin{array}{l} \left( {Q,K,V} \right) = {F_{sem}} \cdot \left( {{W^Q},{W^K},{W^V}} \right) \\ Q,K \in {{\mathbb {R}}^{T \times {d_k}}},V \in {{\mathbb {R}}^{T \times {d_v}}}\\ {W^Q},{W^K} \in {{\mathbb {R}}^{{d_{model}} \times {d_k}}},{W^V} \in {{\mathbb {R}}^{{d_{model}} \times {d_v}}} \end{array} \end{aligned} \end{aligned}$$
(2)

where \({W^Q},{W^K},{W^V}\) are learnable linear transformation coefficient matrices, \({d_k}\) represents the dimension of the column vector of QK and \({d_v}\) represents the dimension of the column vector of V.

Fig. 6
figure 6

Model architecture of multihead attention

First, we get the attention weight by calculating the dot product between the query matrix Q and the key matrix K, then, the Transformer which scales the first dimension and uses softmax to normalize the second dimension. In step-wide tower, we use masking before we sent the input to the softmax layer in attention.

$$\begin{aligned} \begin{aligned} \begin{array}{l} Attention\left( {Q,K,V} \right) = softmax\left( {\frac{{Q{K^T}}}{{\sqrt{{d_k}} }}} \right) V \end{array} \end{aligned} \end{aligned}$$
(3)

Multi-head attention allows the model to jointly attend to information from different representation subspaces at different positions. With a single attention head, averaging inhibits this.

$$\begin{aligned} \begin{aligned} \begin{array}{l} MultiHead\left( {Q,K,V} \right) = Concat\left( {hea{d_1}, \ldots , \mathrm{{ }}hea{d_h}} \right) {W^O}\\ hea{d_i} = Attention\left( {QW_i^Q,KW_i^K,VW_i^V} \right) \\ \end{array} \end{aligned} \end{aligned}$$
(4)

where the projections are parameter matrices \(W_i^Q \in {{\mathbb {R}}^{{d_{model}} \times {d_k}}}\), \(W_i^K \in {{\mathbb {R}}^{{d_{model}} \times {d_k}}}\), \(W_i^V \in {{\mathbb {R}}^{{d_{model}} \times {d_v}}}\), and \({W^O} \in {{\mathbb {R}}^{h{d_v} \times {d_{model}}}}\). In this work, we employ h parallel attention heads.

Each of the first sublayers in the encoder contains a fully connected feed-forward network, which is applied to each position separately and identically. This consists of two linear transformations with a ReLU activation in between.

$$\begin{aligned} \begin{aligned} \begin{array}{l} FFN\left( x \right) = max\left( {0,x{W_1} + {b_1}} \right) {W_2} + {b_2} \end{array} \end{aligned} \end{aligned}$$
(5)

There is a residual connection after the multi-head attention and feed-forward network followed by layer normalization. That is, the output of each sub-layer is \(LayerNorm \left( {x + Sublayer\left( x \right) } \right) \), where \(Sublayer\left( x \right) \) is the function implemented by the sub-layer itself. To facilitate these residual connections, all sub-layers in the tower, as well as the embedding layers, produce outputs of dimension \({d_{model}}\).

The dimensionality of input and output of FFN is \({d_{model}}\), and the hidden-layer has dimensionality \({d_{hidden1}}\).

After the step-wise encoder we can obtain the \({d_{model}}\) dimensions output feature \({F_{se}} \in {{\mathbb {R}}^{T \times {d_{model}}}}\).

This encoder provides information to the LS(combining linear, softmax layers) network. The process is as follows:

$$\begin{aligned} \begin{aligned} {F_{so}} = LS({F_{se}}), {F_{so}} \in {{\mathbb {R}}^{d_{fusion}}} \end{aligned} \end{aligned}$$
(6)

We obtain the \({d_{fusion}}\) dimensions output feature \({F_{so}} \in {{\mathbb {R}}^{d_{fusion}}}\) from the step-wise tower, which is one of inputs to the tensor fusion layer.

Channel-wise tower

The Channel-wise Tower’s role is to discover dependencies among different distance channels. It is simply to implement the encoder by simply transpose the channel and time axis when feeding the time series for the channel-wise tower. Notably, since there is no relative or absolute correlation of the channel’s position in the multivariate time series, we did not include positional encoding in the Channel-wise Tower.

Similar to the step-wise tower, we obtain the \({d_{fusion}}\) dimension output feature \({F_{co}} \in {{\mathbb {R}}^{d_{fusion}}}\) from the step-wise tower, which is then used as one of the inputs for the tensor fusion layer.

Tensor fusion layer

Multivariate time series has multiple channels where each channel is a univariate time series. We aim to build a fusion layer that disentangles dynamics of step-wise tower, channel-wise tower and both of them by modeling each of them explicitly. Reference [52] introduce a novel model, termed Tensor Fusion Network, which learns both such dynamics end-to-end. We build a 2-D tensor fusion layer to capture hidden correlation between step-wise tower and channel-wise tower. The tensor fusion layer is defined as the following vector field using 2-fold Cartesian product:

$$\begin{aligned} \begin{aligned} \left\{ {({f_{co}},{f_{so}})|{f_{co}} \in \left[ {\begin{array}{*{20}{c}} {{F_{co}}}\\ 1 \end{array}} \right] ,{f_{so}} \in \left[ {\begin{array}{*{20}{c}} {{F_{so}}}\\ 1 \end{array}} \right] } \right\} \end{aligned} \end{aligned}$$
(7)

Step-wise tower produces the step-wise embedding \(F_{co} \in {{\mathbb {R}}^{d_{fusion}}}\), \(d_{fusion}\) is the output dimension of step-wise tower. Similarly, we get the channel-wise embedding \(F_{co} \in {{\mathbb {R}}^{d_{fusion}}}\). The extra constant dimension with value 1 generates the unimodal and bimodal dynamics. Each neural coordinate \((f_{co},f_{so})\) can be seen as a 2-D point in the 2-fold Cartesian space defined by the embedding’s dimensions \({\left[ {F_{co}^{},1} \right] ^T},{\left[ {F_{so}^{},1} \right] ^T}\). \({F_g}\) is mathematically equivalent to a differentiable outer product between \({\left[ {F_{co}^{},1} \right] ^T},{\left[ {F_{so}^{},1} \right] ^T}\).

$$\begin{aligned} {F_g} = \left[ {\begin{array}{*{20}{c}} {F_{so}}\\ 1 \end{array}} \right] \otimes {\left[ {\begin{array}{*{20}{c}} {F_{co}}\\ 1 \end{array}} \right] ^T} \end{aligned}$$
(8)

In Eq. 8, \( \otimes \) signifies the outer product between vectors. \({F_g} \in {{\mathbb {R}}^{{(d_{fusion}+1)}^{2}}}\) represents the 2D feature matrix that includes all possible combinations of features from the two towers. The two subregions \(F_{co},F_{so}\) are embeddings from the two towers of the tensor fusion layer. The subregion\(F_{so} \otimes F_{co}^T\) captures bimodal interactions in our tensor fusion layer. The tensor fusion layer in our model is illustrated in Fig. 7.

Fig. 7
figure 7

Tensor fusion layer

Ultimately, as depicted in Fig. 5, we input the global features into the classification decoder, which consists of two cascaded feedforward neural networks: LR (combining linear and ReLU layers) and LS (combining linear and Softmax layers), to predict the final classification score. The dimensionality of output of each LR layer is \({d_{hidden2}}\). The class with the highest score is assigned as the class label.

Table 3 The partition of PD-Walk dataset
Table 4 5-fold cross validation

Experiments and results

Implementation details

We evaluated the performance of our model on the PD-Walk dataset collected by Ref. [26]. The proposed network was implemented using the PyTorch deep learning framework. During the experiment, the network was trained using Adagrad and the categorical cross-entropy was employed as the loss function.

Cross-validation and evaluation metrics

Cross-validation is a statistical method applied to assess the performance of a predictive model on an unknown dataset. We conducted 5-fold cross-validation on the PD-Walk dataset, ensuring that each sample was used in both the training and testing sets. Following the approach in Ref. [26], we divided the data into 5-folds, as shown in Table 3. We experimented in the same manner, training the model on 4 subsets (training sets) and testing it on the remaining subset (test set). The results of the 5-fold cross-validation are presented in Table 4, with an average identification rate of \(86.8\% \pm 5.0\%\).

In classification problems, the primary performance index is accuracy, which is the proportion of correctly classified samples out of the total number of samples for a given data. For binary classification problems, the commonly used metrics are precision and recall. As these two measures can often be at odds, in disease diagnosis, more emphasis is placed on recall, which is the proportion of positive cases correctly identified as such among all actual positive cases. Meanwhile, the F1-Score, the harmonic mean of precision and recall, measures the overall efficacy of the classifier.

These metrics are defined as follows: Accuracy \(=\) (TP \(+\) TN)/(TP \(+\) FN \(+\) TN \(+\) FP), Precision \(=\) TP/(TP \(+\) FP), Recall \(=\) TP/(TP \(+\) FN), F1 \(=\) 2 \(\times \) (Precision \(\times \) Recall)/(Precision \(+\) Recall), where TP, TN, FP, and FN represent the number of true positive, true negative, false positive and false negative samples, respectively. Therefore, accuracy, recall, F1-score were used as evaluation metrics to evaluate the classification results.

Sensitivity analysis

To further verify the effect of parameters on the performance, sensitivity analysis was conducted on model parameters and hyper-parameters. We present results of variations on model parameters in Table 5.

Table 5 Variations on model parameters

On the whole, the sensitivity analysis of model parameters shows that the proposed model is insensitive to other parameter combinations around the benchmark parameter setting, which further proves the robustness of our model. Relatively speaking, the parameters (\({d_{model}}\),\({d_{hidden2}}\)) have a greater degree of influence on the results.

Table 6 Variations on hyper-parameters

We present results of variations on hyper-parameters in Table 6 and hyper-parameter LR(learning rate) a great influence on the results. The reason is that the performance of the learning rate is usually related to the scale of the model, and it is usually difficult to find the optimal parameter combination if the learning rate is too small or too large. The sensitivity analysis shows that our model has a certain stability.

Comparison with state-of-the-art methods

We compare the performance between different methods. From Table 7 results, it is evident that our model outperforms other approaches. Reference [26] evaluate three different methods on the PD-Walk dataset. As presented in the Table 7, the result of SVM using hand-crafted motion features serves as the baseline. ST-GCN achieves a better result than baseline. Reference [26] proposed ADGCN, which considers motion information and global connections, can get accuracy with 84.1%. In contrast, with almost the same configuration as our proposed network, GTN [49] method achieves 84.5% accuracy. Our model outperforms them with 86.8% accuracy, 88.4% recall and 87.8% F1 Score.

Table 7 Overall performance in accuracy, recall, and F1 score

Visualization of the extracted feature

We visualize the result of the embedding layer by outputting the feature vector from the tensor fusion layer and the linear layer that follows it. We applied t-SNE to reduce the dimension for visualization, as shown in Figs. 89 and 10. Each point in the graph is labeled with a corresponding color code. The labels on the left side of the figure correspond to the labels of the original data, while the labels on the right side correspond to the predicted results. It can be seen from Fig. 8 that before tensor fusion, all the points are clustered together without a clear division. After tensor fusion, the overall distribution of the data follows certain rules, and a clear division emerges between the labels of the predicted results, which is especially evident after the linear layer. These figures demonstrate that our model can project the data into a space where it is easily separable for improved classification results.

Fig. 8
figure 8

Visualization of the t-SNE result of the embedding layer output

Fig. 9
figure 9

Visualization of the feature extracted after tensor fusion layer

Fig. 10
figure 10

Visualization of the feature extracted after the linear layer

Visualization and analysis of the attention map

The attention matrix indicates the correlation between channels and time steps, respectively. We selected one sample to visualize the channel-wise attention map and the step-wise attention map.

Fig. 11
figure 11

Visualization of channel-wise attention map

Fig. 12
figure 12

Visualization of step-wise attention map

Fig. 13
figure 13

Drawing of the raw 78 time series

Fig. 14
figure 14

(1) Channel-wise attention map (left) (2) Channel-wise DTW (right)

Fig. 15
figure 15

Drawing of the raw 9 time series

The visualization of the channel-wise attention map for a sample is presented in Fig. 11. The intensity of the colors for channels 69 and 72 at epoch 300 is greater than at epoch 25, while the colors for channels 60 and 65 at epoch 300 are less intense than at epoch 25. As the number of epochs increases, the selectivity for salient features is enhanced, and the response values of the attention coefficients in salient areas become larger.

The visualization of the step-wise attention map is depicted in Fig. 12. Similar to the channel-wise attention map, the selectivity for salient features is enhanced, and the response values of the attention coefficients in salient areas become larger and sparser as the number of epochs increases. Additionally, we plot the raw time series of all channels in Fig. 13. We observed that the darker areas correspond to phases of the gait cycle, as shown in Fig. 13.

Furthermore, for the channel-wise attention map, we calculated the dynamic time war** (DTW) distance across the time series for different channels, as illustrated in Fig. 14. It appears that the attention map is somewhat consistent with dynamic time war**. Our analysis concentrates on the first row of the channel-wise attention map, which represents the attention scores obtained when treating channel 0 as a query. The attention scores for channels 60, 65, 69, and 72 are larger than those for the others. We plotted the raw time series for 9 channels in Fig. 15 and found that channels 60, 65, 69, and 72, which have higher attention scores, fluctuated more than channels 1, 2, 3, and 4, which have lower attention scores, when channel 0 was used as a query.

For each time step, we also simply calculated the Euclidean distance between different channels, as shown in Fig. 16. Since there is no time axis at the same time step, DTW is not required. The visualization is displayed in Fig. 15. Similarly, our analysis focuses on the first row of the step-wise attention map, which represents the attention score obtained when treating step 0 as a query. In Fig. 15, channel 72 demonstrates a downward trend at step 0, which is consistent with the trend observed at steps 40–48 where the time steps have higher attention scores. The latter steps, when used as a query, show some correlation with the preceding steps and are less related to subsequent steps, which aligns with the actual situation. As is known, the gait sequence exhibits a certain periodicity, and the initial state reflects the stage of the gait.

Analysis on different channel pairs

We conducted experiments to explore the role of different channel pairs in recognizing Parkinson’s disease. Joints 11–17, which are related to human lower limb gait, form distance pairs numbered 58–78, resulting in 210 channel pairs. We studied different channel pairs separately to determine the best channel pair for our network’s predictions. The results are displayed in Table 8, where we have retained only the first ten pairs and the last ten pairs. The top channel pair achieved an accuracy of 86.11%, and its visualization is shown in Fig. 17. We discovered that it involves a combination of the right lower limb joints. Similarly, the visualization of the least effective pair is shown in Fig. 18. The combination of numbers 63 and 68 corresponds to the upper body structure, which is more stable.

Additionally, we recorded and visualized the frequency of different combinations in the top 10 channel pairs, as shown in Table 9 and Fig. 19. The combination of joints 11 and 15 occurred with the highest frequency, aligning with the findings in Fig. 17. We also documented and visualized the frequency of combinations in the bottom 10 channel pairs, as shown in Table 10 and Fig. 20. The combination of joints 12 and 17, which is part of the upper body structure, had the highest frequency, corroborating the results in Fig. 18. These findings suggest that lower limb movement is crucial for Parkinson’s disease identification on a broader scale.

Fig. 16
figure 16

(1) Step-wise attention map (left) (2) Step-wise L2 distance (right)

Table 8 Overall performance on different channel pairs

Furthermore, relevant medical literature [50] reports that the limb swing stride of Parkinson’s disease patients is asymmetrical. To study the asymmetry of lower limb swing between healthy individuals and patients, and to validate our preliminary observations of abnormal lower limb movement in Parkinson’s disease, we examined the significance of node combinations related to human limb structure in both groups. The significance test results are presented in Table 11. It was observed that the distance distribution of the left and right upper arms in Parkinson’s showed no significant difference (\(p=0.661>0.05\)), which was consistent with that of healthy individuals. In contrast, the distance distribution of the left and right lower arms exhibited a significant difference (\(p=0.03<0.05\)), signifying asymmetrical movement in Parkinson’s patients compared to healthy individuals. The correlation distance distribution of lower limbs in Parkinson’s disease were not significantly different (\(p>0.05\)). However, by comparing P-values between healthy individuals and patients, we noted the largest discrepancy in the lower limbs, supporting our conclusions. Movement differences in the lower extremities of Parkinson’s patients are more pronounced than in healthy individuals.

Therefore, based on these facts, we can conclude that the gait symptoms of Parkinson’s disease are primarily manifested through lower limb movement, particularly in the right leg.

Application of the model

In the application of the model, the skeleton joints obtained by any camera video only needs to do affine transformation to be used as the input of our model. Affine transformation refers to the relationship between a linear transformation (multiplying by a matrix) and a translation (adding a vector) in a vector space to another space. The affine transformation represents the map** between two graphs [51]. The affine transformation matrix M in Eq. 9 is a \(2\times 3\) matrix, matrix B represents the translation, while the diagonal element in matrix A determines the scaling, and the anti-diagonal element determines the rotation.

$$\begin{aligned} \begin{aligned} A =&\left[ {\begin{array}{*{20}{c}} {{a_{00}}}&{}{{a_{01}}}\\ {{a_{10}}}&{}{{a_{11}}} \end{array}} \right] ,B = \left[ {\begin{array}{*{20}{c}} {{b_{00}}}\\ {{b_{10}}} \end{array}} \right] \\ M =&[A,B] = \left[ {\begin{array}{*{20}{c}} {{a_{00}}}&{}{{a_{01}}}&{}{{b_{00}}}\\ {{a_{10}}}&{}{{a_{11}}}&{}{{b_{10}}} \end{array}} \right] \end{aligned} \end{aligned}$$
(9)

The original pixel point coordinate (xy) after the affine transformation becomes the (uv). The transformation formula is as follows:

$$\begin{aligned} \begin{aligned} \left[ {\begin{array}{*{20}{c}} u\\ u \end{array}} \right] = A \cdot \left[ {\begin{array}{*{20}{c}} x\\ y \end{array}} \right] + B \end{aligned} \end{aligned}$$
(10)

The relationship between different image pixels can be obtained by affine transformation. In addition, the skeleton used in our model are from the videos of subjects walking back and forth facing the camera. So, the inverse diagonal elements of A in the affine transformation matrix equal to 0 if the subject is walking back and forth towards any camera, besides, the distances of skeleton joints are used as the inputs of our model, so the elements of matrix B will not affect the result. Therefore, only the anti-diagonal elements influence the pixel’s relationship, which represents the pixel-scale relationship of the two pictures. Beisdes, we can divide the skeleton tensor into several fragments and send it to the network. Each fragment is examined, and the results of these fragments are evaluated comprehensively. Therefore, this shows that our model can be widely applied.

Fig. 17
figure 17

Visualization of the channel pairs with the top accuracy

Fig. 18
figure 18

Visualization of the channel pairs with the last accuracy

Table 9 Frequency of different combinations in top10 channel pairs
Table 10 Frequency of different combinations in last10 channel pairs
Fig. 19
figure 19

Visualization of the frequency of different combinations in top10 channel pairs

Fig. 20
figure 20

Visualization of the frequency of different combinations in last10 channel pairs

Table 11 The results of significance test

Conclusions

In this paper,to realize automated quantitative assessment of gait motor disorder in PD patients using gait videos, we developed a network called the twin-tower Transformer network with tensor fusion for skeleton sequences of individuals with Parkinson’s disease and healthy controls to detect early Parkinson’s disease. The task was transformed into a multivariate time series classification task by calculating the joint distances. Specifically, the spatial distances and temporal dynamics of the joints were modeled by the encoder of transformer network, besides, the tensor fusion layer was used to uncover the influence of step features, channel features, and their fusion features. Moreover, we conducted comprehensive experiments on a dataset, and the preliminary results indicate that our network can achieve state-of-the-art performance with an 86.8% accuracy. Additionally, we performed visual analyses to enhance the interpretability of our model. Our experiments on different channel pairs in the recognition of Parkinson’s disease and the significance testing led us to conclude that the gait symptoms of Parkinson’s disease are predominantly characterized by lower limb movement, especially in the right leg. Furthermore, our research extends the modeling of skeletons and the detection of Parkinson’s disease using a Transformer network.