A human activity recognition method based on Vision Transformer

Han, Huiyan; Zeng, Hongwei; Kuang, Liqun; Han, **e; Xue, Hongxin

doi:10.1038/s41598-024-65850-3

A human activity recognition method based on Vision Transformer

Article
Open access
Published: 03 July 2024

Volume 14, article number 15310, (2024)
Cite this article

Download PDF

You have full access to this open access article

Scientific Reports

A human activity recognition method based on Vision Transformer

Download PDF

Huiyan Han^1,2,3,
Hongwei Zeng^1,2,3,
Liqun Kuang^1,2,3,
**e Han^1,2,3 &
…
Hongxin Xue^1,2,3

101 Accesses
Explore all metrics

Abstract

Human activity recognition has a wide range of applications in various fields, such as video surveillance, virtual reality and human–computer intelligent interaction. It has emerged as a significant research area in computer vision. GCN (Graph Convolutional networks) have recently been widely used in these fields and have made great performance. However, there are still some challenges including over-smoothing problem caused by stack graph convolutions and deficient semantics correlation to capture the large movements between time sequences. Vision Transformer (ViT) is utilized in many 2D and 3D image fields and has surprised results. In our work, we propose a novel human activity recognition method based on ViT (HAR-ViT). We integrate enhanced AGCL (eAGCL) in 2s-AGCN to ViT to make it process spatio-temporal data (3D skeleton) and make full use of spatial features. The position encoder module orders the non-sequenced information while the transformer encoder efficiently compresses sequence data features to enhance calculation speed. Human activity recognition is accomplished through multi-layer perceptron (MLP) classifier. Experimental results demonstrate that the proposed method achieves SOTA performance on three extensively used datasets, NTU RGB+D 60, NTU RGB+D 120 and Kinetics-Skeleton 400.

A review of convolutional neural networks in computer vision

Article Open access 23 March 2024

PointDMIG: a dynamic motion-informed graph neural network for 3D action recognition

Article 27 June 2024

GaitRA: triple-branch multimodal gait recognition with larger effective receptive fields and mixed attention

Article 20 June 2024

Introduction

Human activity recognition is an important research subject in the field of computer vision, whose main task is to identify movements of human body in various visual scene. The emergence of image sensors and immense computational capabilities of AI make human activity recognition possible. In recent years, methods based on deep learning have better recognition accuracy and generalization performance than traditional methods, but some human behaviors are more similar, which makes the recognition rate low.

The human body is a complex structure with multiple degrees of freedom, and a single two-dimensional image/video cannot provide an unique or stable solution. In contrast, 3D skeleton data provides body pose and movement information directly, exhibits better invariance (including scale, camera viewpoint and noise of background interference), enabling it a more accurate representation of human activity without ambiguity, as shown in Fig. 1. Meanwhile, depth sensors (Kinect), availability of pose estimation algorithms (TransPose^1,2, TokenPose³) and large-scale standard datasets^4,5 make the skeleton-based HAR research extensive.

Originally, the researchers extracted manual features from skeleton sequences^6,7. With the development of deep learning, various network architectures have been used to process this type of data. Recurrent neural networks (RNNs) are employed to compute the temporal information^8,9. Convolutional neural networks (CNNs) denote skeleton data as pseudo-images (belongs to Euclidean space)^10,11. GCNs have been widely used because its strong ability to capture intrinsic relationships between nodes (joints) in non-Euclidean space (skeleton graph). ST-GCN is the first GCN algorithm working on 3D skeleton data, it utilizes spatio-temporal graph convolution kernels for down-sampling in order to achieve classification results¹². 2s-AGCN¹³ generates dynamic skeleton graphs using an adaptive attention mechanism. AGCN algorithm enhances the dynamic graph structure of the inference process based on ST-GCN¹⁴. CA-GCN relies on context information (capture feature of each vertex by integrating information of all other vertices and long range dependencies among joints) to output classification results¹⁵. SS-GCN constructs a novel graph network based on both spatio-temporal information and spectral-domain information¹⁶. CTR-GCN dynamically learn different topologies and effectively aggregate joint features in different channels for skeleton-based action recognition¹⁷. HD-GCN decomposes every joint node into several sets to extract major structurally adjacent and distant edges, and uses them to construct an HD-Graph containing those edges in the same semantic spaces of a human skeleton¹⁸. LKA-GCN enlarges the receptive field and improves channel adaptability without increasing too much computational burden¹⁹. DeGCN learns the deformable sampling locations on both spatial and temporal graphs, enabling the model to perceive discriminative receptive fields²⁰. In DS-GCN, the joints and edge types are encoded in the skeleton topology in an implicit way, the joints type-aware adaptive topology and the edge type-aware adaptive topology are proposed²¹.

Although GCN-based methods have made significant progress, these methods still have two challenges: (1) how to model remote (long-distance) dependencies between joints more accurately, thereby alleviating the over-smoothing problem caused by stack graph convolutions. (2) How to improve robustness and semantics correlation to capture the large movements between time sequences. Therefore, the motivation of our work is to offer a feasible and effective approach for addressing these aforementioned limitations, which can be summarized as follows:

Motivation 1

ViT model applies the Transformer architecture in image recognition²², it interprets an image as a sequence of patches and processes it by a standard Transformer encoder as used in NLP. This simple, yet scalable strategy works surprisingly well when coupled with pre-training on large datasets. ViT matches or exceeds SOTA methods on many image classification datasets, whilst being relatively cheap to per-training. Both the lower layers and the high layers of the ViT model structure can have a large field of view, global feature information can be obtained in initial layer, so it can well guarantee the global and local features integrity and can resolve the over-smoothing problem caused by stack graph convolutions. Skip connections have a huge impact on the propagation and representation of feature, and so it can capture semantics correlations of joints between time sequences. ViT retains location information while transmitting feature information. By leveraging the advantages of ViT and addressing the shortcomings of GCN-based methods, we innovatively apply ViT in this kind of skeleton data with spatial and temporal characteristics.

Motivation 2

The inputs of both ViT and its other applications are 2D images, ViT can not be directly used in 3D skeleton data. The adaptive matrix is integrated in adjacency matrix of 2s-AGCN in the graph convolution, so that the model can adjust the topology of the graph adaptively and adapt to different input data. This innovative design greatly improves the recognition accuracy of skeleton data. However, the ability of capturing hidden data is insufficient and it is easy to cause gradient disappearance in product operation. We enhance adaptive graph convolutional layer (AGCL) in 2s-AGCN by replacement two embedding functions and normalization by a scale factor of Scaled Dot-Product Attention of trainable adjacency C_k, namely eAGCL, thus it can automatically learn the connection strength problem between joints in the sample data and solve the gradient disappearance problem.

Our contributions

In summary, our work makes three major contributions:

(1)
For the first time, ViT is applied to skeleton 3D data and put forward a human behavior recognition method HAR-ViT.
(2)
The position encoder in ViT is rewritten to order the non-sequenced information (skeleton data) and reduce the idle spatial coding information.
(3)
We propose eAGCL model based on AGCL in 2s-AGCN, improve utilization of spatial features of our network model.

At last, our HAR-ViT exhibits competitive performance on two public, authoritative recognition datasets (NTU-RGB+D 60 , NTU-RGB+D 120 and Kinetics-Skeleton 400) and outperforms some SOTA methods to a certain extent.

The rest of paper is organized as follows. Section "Related work" introduces the related work. Section "Materials and methods" introduces the components of our new method proposed in this paper. The ablation study and the comparison with the SOTA methods are are shown in Section "Experiments and results". Section "Conclusions" concludes the paper.

Related work

This work aims to design a more robust solution for skeleton-based action recognition tasks inspired by ViT. This section discusses some works based on ViT and AGCL in 2s-AGCN.

ViT

ViT attains excellent image classification results compared to SOTA convolutional networks while requiring substantially fewer computational resources to train. It shows that the reliance on CNNs is not necessary and a pure transformer applied directly to sequences of image patches can perform very well, especially when train with large scale training sets. In addition, the model pre-trained on large-scale datasets can also achieve better performance than CNN when migrating to medium or small datasets.

The lowest level of ViT allows the model to have larger windows through the self-attention mechanism. In the shallow layer, the model gradually acquires local and global characteristics while at the deep layer, the model has acquired the characteristics of a global view from the very beginning. Skip links play a important role on the propagation and representation of feature, and if they are removed, the accuracy of the model decreases by about 4%. The similarity between the input image and the feature map of the last layer in ViT is very high and this indicates that ViT retains location information while propagating feature information.

Pyramid ViT implements a variable self-attention mechanism through a space-reduction attention mechanism and is applied to ViT models to overcome square complexity in the attention mechanism²⁴. DINO is a self-supervised training framework proposed by Meta's AI team based on ViT, which can be trained on large-scale unlabeled data and obtain robust feature representation, even without a fine-tuned linear layer²⁵. Scaling ViT has won first place in ImageNet's recognition result, it is proposed by Google Brain team by scaling up ViT model with 2 billion parameters²⁶. SegFormer proposal by Nvidia focuses on the componentization of the system and adopts a simple MLP decoding model without requiring position encoder²⁷.UNETR (Unet+ViT) utilizes a transformer as the encoder to learn sequence representations of the 3D medical images and effectively capture the global multi-scale information, the transformer encoder is directly connected to a decoder via skip connections at different resolutions to compute the final semantic segmentation output, it achieves preferable semantic segmentation²⁸.We apply ViT to 3D skeleton human behavior recognition on account of its excellent performance on 2D and 3D image process and propose a method namely HAR-ViT.

AGCL in 2s-AGCN

The AGCL serves as the fundamental neural network layer of 2s-AGCN, enabling an end-to-end learning approach to optimize skeleton data. The network of AGCL is illustrated in Fig. 2. Its topology and network parameters are designed to enhance flexibility by accommodating unique data graphs for different layers and samples. Additionally, it is constructed as a residual branch to ensure the stability of the original model.

The graph convolution rule of AGCL is shown in Eq. (1), where f_out and f_in denote the output and input of the network respectively. K_v represents the kernel size of the spatial dimension, which is set to 3. k denotes the position of kernel, while W_k signifies the operation of a 1 × 1 convolution. The adjacency matrix is divided into three parts: A_k, B_k, and C_k. A_k denotes the fixed connectivity pattern among human joints. B_k demonstrates the connection strength of two joints. The elements of B_k are parameterized and optimized together with the other parameters in the training process. C_k is a data dependency graph that learns a unique graph for each sample. To determine the existence and strength of connections between two joints, it uses the normalized embedding Gaussian function to calculate the similarity between two joints (Eq. 2), while utilizing dot product to measure their similarity in the embedding space. By applying two embedding functions θ(v_i) and φ(v_j) to the feature map to rearrange them into matrices with dimensions N × T and T × N respectively, then multiply these matrices together to obtain an N × N similarity matrix C_k. Each element of C_k represents the similarity between corresponding joints $\left({v}_{i}, {v}_{j}\right)$.

$$ {{\text{f}}_{{{\text{out}}}} = \sum\limits_{{{\text{k}} = 1}}^{{{\text{K}}_{{\text{v}}} }} {{\text{W}}_{{\text{k}}} {\text{f}}_{{{\text{in}}}} } \left( {{\text{A}}_{{\text{k}}} + {\text{B}}_{{\text{k}}} + {\text{C}}_{{\text{k}}} } \right)} $$

(1)

$$\begin{array}{c}f\left({v}_{i}, {v}_{j}\right)=\frac{{e}^{{\theta \left({v}_{i}\right)}^{T}\varphi \left({v}_{j}\right)}}{{\sum }_{j=1}^{N}{e}^{{\theta \left({v}_{i}\right)}^{T}\varphi \left({v}_{j}\right)}}\end{array}$$

(2)

The residual connection of B_k and C_k in the model enhances its flexibility and stability without sacrificing the original performance. However, exponential order computing of Gaussian function ignores hidden information between samples and fails to sufficiently calculate the similarity between the joints. Moreover, its computational cost is very expensive and prone to gradient disappearance. We improve the accuracy of similarity and decrease computing costs by replacement the two embedding functions (θ and φ) and normalization by a scale factor.

Materials and methods

In this paper, we propose an end-to-end HAR-ViT algorithm that creatively applies ViT to the field of human activity recognition. The overall framework of the model is illustrated in Fig. 3.

To address the limitation of ViT not being able to process 3D skeleton data, this study enhances the AGCL in 2s-AGCN. This enhancement enables extraction of spatial features encompassing connection relationships and strengths between joints. These features are then sorted along the timing axis before being transformed into feature vectors through linear projection.

In order to capture temporal features, learnable embedding vectors are added to the end of the feature vectors, while a position encoder is integrated with them to provide temporal information. Subsequently, these fused feature vectors serve as input for transformer encoder where multiple Transformer Layer networks extract temporal feature by means of similarity computation. Finally, all temporal features are compressed for classification using MLP classifier.

eAGCL

The Internal network structure of eAGCL is described in Fig. 4. We introduce a novel trainable matrix as a replacement for the two Gaussian functions in C_k. The trainable matrix facilitates the optimization of network parameters C_k for each sample through back propagation, thereby enhancing the efficiency and effectiveness of learning from sample data. The similarity between joints is computed using a covariance matrix, which is presented in Eq. (3), $X=\left[\begin{array}{cc}\begin{array}{cc}\overrightarrow{{x}_{0}}& \overrightarrow{{x}_{1}}\end{array}& \begin{array}{cc}\cdots & \overrightarrow{{x}_{n}}\end{array}\end{array}\right]$ is a d-dimensional eigenvector and ${X}^{T}$ is the transpose of X. The covariance matrix can capture the similarity between different dimensions in multiple elements, and the stronger the similarity, the higher the covariance value.

The shape of the skeleton data is $x\in {R}^{C\times V\times T}$, C represents the x, y, z coordinates, V denotes the number of joints and T is the length of the skeleton sequence. To convert the skeleton data into d-dimensional feature vectors, a trainable matrix conv is introduced to transform the skeleton data into $\text{x}\in {\text{R}}^{\text{d}\times \text{T}}$, elements in the trainable matrix can be regarded as initial weight values. They are optimized through back propagation. The dot product of each generated feature vector is divided by $\sqrt{\text{d}}$, and we can get the final weighted values using softmax. Expression for C_k is shown in Eq. (4). The value of variance of each element in C_k depends on d, thus pushing softmax function into the area with minimal gradient. To counteract this effect, we scale the dot products by $\frac{1}{\sqrt{d}}$.

$$\begin{array}{c}{X}^{T}X=\left[\begin{array}{ccc}\begin{array}{cc}{x}_{0}\cdot {x}_{0}& {x}_{0}\cdot {x}_{1}\\ {x}_{1}\cdot {x}_{0}& {x}_{1}\cdot {x}_{1}\end{array}& \begin{array}{c}\cdots \\ \cdots \end{array}& \begin{array}{c}{x}_{0}\cdot {x}_{n}\\ {x}_{1}\cdot {x}_{n}\end{array}\\ \begin{array}{cc}\vdots & \vdots \end{array}& \ddots & \vdots \\ \begin{array}{cc}{x}_{n}\cdot {x}_{0}& {x}_{n}\cdot {x}_{1}\end{array}& \cdots & {x}_{n}\cdot {x}_{n}\end{array}\right]\end{array}$$

(3)

$$\begin{array}{c}{C}_{k}=softmax\left(\frac{{\text{conv}(\text{x})}^{T}\text{conv}(\text{x})}{\sqrt{\text{d}}}\right)\end{array}$$

(4)

$$\begin{array}{c}\left\{\begin{array}{c}PE\left(pos,2l\right)=\text{sin}\left(\frac{pos}{{T}^\frac{2l}{n}}\right)\\ PE\left(pos,2l+1\right)=\text{cos}\left(\frac{pos}{{T}^\frac{2l}{n}}\right)\end{array}\right.\end{array}$$

(5)

Positional encoder

Skeleton data is a form of sequential data that necessitates the utilization of a position encoder to augment the temporal positioning information inherent in the skeleton data. However, ViT employs learnable embedding vectors on position encoders without any involvement in supplementing timing information. While the position encoder in NLP will generate significant superfluous positional encoding information when processing skeleton data.

To address this issue, this paper proposes a redefined position encoder as depicted by Eq. (5), PE(pos,2 l) denotes the output result of the positional encoder, and pos represents the position of the feature vector in skeleton sequence, 2 l and 2 l + 1 indicate its odevity of pos, and n signifies the dimensionality of skeleton. The sine function is employed when pos is odd while cosine function is utilized when pos is even, T corresponds to the maximum length of the skeleton sequence. In order to capture and retrieve the entire sequence information, the model needs to add a new vector * at the end of the feature vector, which is then used to embed the position encoder value for each frame.

The relative position of the skeleton sequence can be calculated using Eq. (6) when its offset is m (a positive integer). Upon deduction, an elegant dot product formula emerges, which coincides with the standard inner product formula in Euclidean space. This formulation reveals a crucial topological structure that can be described mathematically: the encoded result at any given position can be decomposed into the dot product of two summations.

$$PE\left(pos+m,2l\right)=\text{sin}\left(\frac{pos}{{1000}^\frac{2l}{n}}+\frac{m}{{1000}^{\frac{2l}{\text{n}}}}\right)$$

$$ \begin{aligned} = & \cos \left( {\frac{pos}{{1000^{\frac{2l}{n}} }}} \right)sin\left( {\frac{m}{{1000^{{\frac{2l}{{\text{n}}}}} }}} \right) + sin\left( {\frac{pos}{{1000^{\frac{2l}{n}} }}} \right)cos\left( {\frac{m}{{1000^{{\frac{2l}{{\text{n}}}}} }}} \right) \\ = & PE\left( {pos,2l} \right)sin\left( {\frac{m}{{1000^{\frac{2l}{n}} }}} \right) + PE\left( {pos,2l + 1} \right)cos\left( {\frac{m}{{1000^{\frac{2l}{n}} }}} \right){ } \\ { } = & PE\left( {pos,2l} \right)PE\left( {m,2l + 1} \right) + PE\left( {pos,2l + 1} \right)PE\left( {m,2l} \right) \\ = & \left( {{\text{PE}}\left( {\text{pos,2l}} \right){\text{,PE(pos,2l + 1)}}} \right) \odot {\text{(PE}}\left( {\text{m,2l + 1}} \right){\text{,PE}}\left( {\text{m,2l}} \right){)} \\ \end{aligned} $$

(6)

The topology employed facilitates the acquisition of relative position results effortlessly. We quickly obtain its frame sequence by introducing position embedding. Its visual heat map is shown in Fig. 5, where the horizontal coordinate represents the dimension and the vertical coordinate represents the frame number.

Position Encoder structure diagram is illustrated in Fig. 3, where rounded rectangle in blue with frame sequence number represent skeleton data and rounded rectangle in pink describe the positional embedding which corresponding with the skeleton data. We introduce a novel frame embedding vector with “T+1” and a novel positional embedding vector *, aiming to effectively capture and retrieve comprehensive timing sequence information.

Transformer encoder

The Transformer Encoder, comprising MLP and self-Attention modules, is employed to transform input sequences into hidden representations. The initial input consists of the human pose embedding vector for each position, which is then encoded into a fixed-length hidden vector representation through multiple layers of self-attention mechanism and fully connected layers. Figure 6 illustrates the structure of the Transformer Encoder.

Multi-layer perceptron (MLP) classifier

The multi-layer neural network is represented by Eq. (7), where W₁ and W₂ denote the weights of the first and second layers respectively, while b₁ and b₂ represent the biases of the first and second layers correspondingly. Moreover, max(0, x) denotes the ReLU() activation function. It should be noted that through feed forward propagation, this neural network can approximate any continuous or square integrable function with arbitrary precision, thereby enabling accurate classification of any finite training sample set. The model exhibits a classification effect due to the inclusion of feed forward neural networks. Specifically, the frame embedding vector * is extracted from the end of the Transformer compressed coding vector and fed into the MLP classifier, as depicted in Fig. 6.

$$\begin{array}{c}FFN\left(x\right)=max\left(0,{W}_{1} X+{b}_{1} \right){W}_{2}+{b}_{2} \end{array}$$

(7)

Ethical approval and informed consent

Data used in our study are publicly available, and ethical approval and informed consent were obtained in each original study.

Experiments and results

Skeleton-based action datasets

To demonstrate the effect of the proposed HAR-ViT, four datasets were utilized in this paper: NTU RGB+D 60, NTU RGB+D 120, Kinetics-Skeleton 400 and our homemade data. The brief introduction is as follows.

NTU RGB+D. It is a large-scale human action recognition dataset. NTU RGB+D 60 contains 56,880 sequences over 60 classes. It provides the 3D Cartesian coordinates of 25 joints, which are captured from 3 Microsoft Kinect v2 cameras with different viewpoints, for each human in an action sample. Each action sample is performed by 40 volunteers in different age groups. NTU RGB+D 120 is an extended version of NTU RGB+D 60 with an additional 60 action classes, with a total of 113,945 sequences. The datasets can be accessed publicly on https://rose1.ntu.edu.sg/dataset/actionRecognition/. We use four benchmarks recommended by the official for a fair comparison with SOTA methods:

(1)
NTU60 cross-subject (NTU60-Xsub): where the 40 subjects are divided into training and testing groups.
(2)
NTU60 cross-view (NTU60-Xview):the data from camera views 2 and 3 are used for training, and data from camera view 1 is used for testing.
(3)
NTU120 cross-subject (NTU120-Xsub):where the 106 subjects are divided into training and testing groups.
(4)
NTU120 cross-setup (NTU120-Xset): the data from samples with even setup IDs are used for training, and data from samples with odd setup IDs are used for testing.

Kinetics-Skeleton 400. The skeleton data includes 18 major joints of the human body. It contains more than 300,000 clips covering 400 action categories. Among 260,000 total samples, 240,000 samples are used for training and 20,000 for testing. The datasets can be accessed publicly on https://deepmind.com/research/open-source/kinetics.

Homemade datasets. We utilize 3 cameras with different views to capture 10 action classes, each action is performed from 10 volunteers (in our research laboratory) , it contains 100 videos.

Experiment description

The baseline and platform details are depicted as:

Comparison Baseline: In the experiment, the comparison baseline is the representative 2s-AGCN¹³ consisting of 10 layers of TCN-GCN. To prove the effectiveness of our methods, the training and test samples of this work are consistent with the baseline. No additional training strategies are applied in this work.

Platform Details:The experimental setup for this study involves a NVIDIA GeForce RTX 2070 SUPER server running on Ubuntu 20.04.5 LST (CPU: 4 cores, memory: 16 GB, video memory: 8 GB), paddle2.5.1, and cuda11.4. Table 1 presents the training configuration parameters used in HAR-ViT.

Table 1 The training configuration parameters of HAR-ViT.

Full size table

Training comparison with SOTA methods

We compare the training results with three open source code SOTA methods, including ST-GCN, 2s-AGCN and DSTA-Net.

The number of model parameters of each comparison algorithm is presented in Table 2. It can be observed that the model proposed in this paper achieves a reduction of 57.24% in parameter count compared with 2s-AGCN, while also demonstrating improved efficiency and lightweight characteristics.

Table 2 Comparison of parameter number of four methods.

Full size table

Experiments on NTU RGB+D 60

In NTU60-Xsub experiment, the training set and testing set are divided in 1:1, ensuring complete independence between the subjects. This is performed to assess the model's capability of recognizing unfamiliar subjects. The training results of each compared algorithm are illustrated in Fig. 7. After the same training epochs, the average accuracy of HAR-ViT is higher than other three methods, and it soon reaches a higher level after 30 training epochs, so the proposed algorithm exhibits a higher fitting speed.

In NTU60-Xview experiment, the ratio between the training set and the test set is 2:1. Through such experiments, we can conduct a more specific analysis of the model's effectiveness in recognizing unconventional angular actions (view 1) when trained on other views (view 2 and 3). The training results of each comparative algorithm are depicted in Fig. 8. Our proposed model demonstrates exceptional adaptability to diverse shooting angles, exhibiting a notable 5.9% improvement compared to NTU60-Xsub. Moreover, when compared with other algorithms, HAR-ViT showcases a more pronounced enhancement.

Experiments on NTU RGB+D 120

In NTU120-Xsub experiment, the ratio between the training set and the test set is 1:1. The results of the four methods are illustrated in Fig. 9. In comparison to NTU RGB+D 60, there is a decrease in average accuracy, which can be attributed to the expanded range of identified categories. Our model demonstrates greater effectiveness when compared with other algorithms.

In NTU120-Xset experiment, the training results of each model are depicted in Fig. 10. We can observe that the accuracy of all the models slightly decreased compared to NTU120-Xsub experiment, but the average accuracy of our model is higher than that of 2s-AGCN and other models. This demonstrates that our method exhibits more pronounced advantages when compared with other algorithms.

Experiments on Kinetics-Skeleton 400

The training results of each model on Kinetics-Skeleton 400 are described in Fig. 11. We can see that the accuracy of all the models slightly decreased compared to NTU RGB+D, because the number of action categories covered is 3.3 times that of NTU RGB+D. Otherwise, the average accuracy of our model is higher than that of 2s-AGCN and other models. It shows that the algorithm in this paper has stronger generalization ability.

Test accuracy comparison with SOTA

To demonstrate the effectiveness and advancements of our HAR-ViT, the test recognition performance is compared with SOTA methods, and the visual classification results of the four compared methods using aforementioned trained model in Section "Training comparison with SOTA methods" are also demonstrated.

Experiments on NTU RGB+D 60

The comparison on NTU-RGB+D 60 is shown in Table 3, the best results are in bold and the suboptimum results are italics. The recognition accuracy of our HAR-ViT is increased by 2.56% under X-Sub and 1.63% under X-View than 2s-AGCN. The DSTA-Net represents the optimal outcome achieved by X-Sub, effectively extracts temporal and spatial features through the spatio-temporal attention. HAR-ViT exhibits a slight decrease of 0.44%under X-Sub compared with DSTA-Net, potentially due to the equilibrium of spatio-temporal convolution in DSTA-Net. Notably, HAR-ViT outperforms AAM-GCN and LCK-GCN by 0.66% and 0.36% respectively, simulating remote features using attention mechanisms as in 2s-AGCN.

Table 3 Comparison of accuracy of nine methods on NTU-RGB+D 60.

Full size table

The performance of Shift-GCN is sub-optimal under X-view, with HAR-ViT outperforming it by 0.23% under X-view. The incorporation of the Shift graph convolution operation and lightweight point convolution in Shift-GCN enhances its spatial feature extraction capability. However, HAR-ViT's eAGCL effectively adapts to skeleton data through trainable matrices, mitigating the impact of positional errors and yielding superior results compared to Shift-GCN. Under X-view, HAR-ViT achieves improvements of 0.33% and 0.43%over AAM-GCN and DSTA-Net respectively. AAM-GCN exhibits limited generalization ability when confronted with different angles, while DSTA-Net lacks a fixed skeleton graph in its spatial attention mechanism. In contrast, eAGCL within HAR-ViT maintains a consistent skeleton diagram and ensures model stability through residual connections, leading to better performance across different shooting angles compared to other algorithms.

Experiments on NTU RGB+D 120

The comparison on NTU-RGB+D 120 of 6 methods is illustrated in Table 4, the best results are in bold and the sub-optimum results are italics. The recognition performance of our method is 5.81% under X-Sub and 4.12% under X-Set higher than the baseline 2s-AGCN.

Table 4 Comparison of accuracy of nine methods in cross-view experiment.

Full size table

The accuracy of our method is 1.01% and 0.02% higher than that of the sub-optimal DSTA-Net under X-Sub and X-Set. DSTA-Net expands the receptive field through the self-attention mechanism and achieves excellent recognition performance. However, the feature decoupling strategy of DSTA-Net involves four streams, which leads to expensive computation in case of large-scale sample. In contrast, our HAR-ViT achieves similar performance using only one stream .

Top 5 test results

The classification performance of the four methods of four actions “drinking water”, “touch pocket”,“shaking hands” and “punch/clap” on NTU RGB+D are shown in Fig. 12. All four methods correctly identify the four actions, notably, our method exhibits higher confidence than the baseline 2s-AGCN for all the four actions and demonstrates an impressive average confidence level of 94%, while the confidence of 2s-AGCN is 80%. The confidence of ours is the highest among the all the four methods, the validity of our method is illustrated again.

The average inference time of the four algorithms in the test data of the standard dataset is presented in Table 5, with optimal results highlighted in bold and sub-optimal results italics. HAR-ViT achieves an average inference time of 4.75 s, which is 3.5 s faster than both 2s-AGCN and DSTA-Net. This improvement can be attributed to HAR-ViT's reduction in parameter count and substitution of exponential Gaussian function calculations with matrix multiplication, thereby reducing complexity in inference computations.

Table 5 Mean inference time on the standard datasets.

Full size table

Experiments on Kinetics-Skeleton 400

The comparison on Kinetics-Skeleton 400 is shown in Table 6, the best results are in bold and the sub-optimum results are italics. Our HAR-ViT also achieves excellent performance promotions (+ 2.0% under Top-1 and + 2.2% under Top-5) over the baseline 2s-AGCN. The recognition accuracy is 0.3% higher than LKA-GCN and AGCN under Top-1. Under Top-5, ours is also 0.4% higher than AAM-GCN. It can be found that the performance of AGCN is better than our method and it utilizes multi-stream branch structure and has strong generalization ability.

Table 6 Comparison on Kinetics-Skeleton 400 dataset.

Full size table

The classification performance of the four methods of four actions “arm wrestling”, “bar tending”,“bending back” and “book binding” under Top-5 on Kinetics-Skeleton 400 are shown in Fig. 13. All four methods correctly identify the four actions, notably, our method exhibits higher confidence than the baseline 2s-AGCN for all the four actions and demonstrates an impressive average confidence level of 54%, while the confidence of 2s-AGCN is 39%. Because the dataset covers far more classes than NTU RGB+D, the confidence level drop significantly for all the four methods.

Experiment on real-world datasets

In order to prove the generalization ability of our algorithm, we also test it on homemade datasets in addition to widely used standard datasets. The classification results of four action “clap**”, “brush teeth”,“sneeze/cough” and “salute” are shown in Fig. 14. We can see that four methods can recognize all the four actions positively, however, our method exhibits higher confidence than all the other three methods even for the three similar actions( “brush teeth”,“sneeze/cough” and “salute” ). ours demonstrates an impressive average confidence level of 96%, while the confidence of 2s-AGCN is 86%.

Table 7 presents the average inference time of the four algorithms on the homemade datasets. The optimal results are highlighted in bold, while the sub-optimal results are italics. HAR-ViT demonstrates an average inference time that is 4 s and 3 s faster than 2s-AGCN and DSTA-Net, respectively. Since the skeleton data in the real environment consists of single action, its inference time is significantly shorter compared to that of standard datasets in Table 6.

Table 7 Mean inference time on the homemade datasets.

Full size table

Ablation experiment

In order to validate the efficacy of each module in our proposed method, we conducts ablation experiments. HAR-ViT model combines the strengths of 2s-AGCN and ViT models. However, it should be noted that 2s-AGCN has the difficulty of gradient disappearance and unsubstantial capturing in hidden data. Hence, we have made enhancements to 2s-AGCN, namely eAGCL, and it serves as the baseline, the results of the ablation experiment are presented in Table 8. We can include that the each model in our method plays an important role. By incorporating eAGCL, ViT's applicability is extended from 2D images to 3D skeleton information. Additionally, the introduction of a position encoder enables ViT to specialize in time series data. Transformer encoder efficiently compresses sequence data features to enhance calculation speed.

Table 8 Ablation experiment results of our method.

Full size table

Increasing the depth of the Transformer encoder leads to excessive complexity and overfitting of the model. Ablation experiments of the depth of Transformer encoder are shown in Table 9, it is demonstrated that depth of 12 achieves optimal performance, and this configuration was adopted in all experiments in this paper.

Table 9 Ablation experiment results of the depth of transformer encoder.

Full size table

Conclusions

The paper presents HAR-ViT, a novel method for human activity recognition. X-sub, X-view and X-set experiments on standard widely datasets demonstrate that the HAR-ViT model exhibits fast training speed and requires fewer calculation parameters. Mathematical analysis and test experimental results further confirm that HAR-ViT can achieve SOTA performance and have higher accuracy than some GCN-basd methods.

However, we also found some limitations. On three skeleton-based action datasets, this work demonstrates the model prototype and provides a clear plan for further study, but doesn’t test on other specific action datasets. In the future, the scope of our research will widen to include other forms of data, including depth maps and point cloud. Furthermore, our method cannot solve the incomplete skeleton data, because the lack of any joint may lead to inaccurate attention results. Therefore, it is also to be studied in the future.

Data availability

The datasets generated during and/or analyzed during our study are available from the corresponding author on reasonable request.

References

Yang, S., Quan, Z. & Nie, M. et al. Transpose: Keypoint localization via transformer. In Proceedings of the IEEE/CVF international conference on computer vision. 11802–11812 (2021).
Yi, X., Zhou, Y. & Xu, F. Transpose: Real-time 3D human translation and pose estimation with six inertial sensors. ACM Trans. Graph. (TOG) 40(4), 1–13 (2021).
Article Google Scholar
Li, Y., Zhang, S. & Wang, Z. et al. Tokenpose: Learning keypoint tokens for human pose estimation. In Proceedings of the IEEE/CVF international conference on computer vision. 11313–11322 (2021).
Shahroudy, A., Liu, J. & Ng, T. T. et al. NTU RGB+D: A large scale dataset for 3D human activity analysis. In Proceedings of the IEEE conference on computer vision and pattern recognition. 1010–1019 (2016).
Liu, J. et al. NTU RGB+D 120: A large-scale benchmark for 3D human activity understanding. IEEE Trans. Pattern Anal. Mach. Intell. 42(10), 2684–2701 (2019).
Article PubMed Google Scholar
Vemulapalli, R., Arrate, F. & Chellappa, R. Human action recognition by representing 3D skeletons as points in a lie group. In Proceedings of the IEEE conference on computer vision and pattern recognition. 588–595 (2014).
Wang, J. et al. Learning actionlet ensemble for 3D human action recognition. IEEE Trans. Pattern Anal. Mach. Intell. 36(5), 914–927 (2013).
Article Google Scholar
Zhang, S., Liu, X. & **ao, J. On geometric features for skeleton-based action recognition using multilayer LSTM networks. In 2017 IEEE winter conference on applications of computer vision (WACV). IEEE, 148–157 (2017).
Liu, J. et al. Skeleton-based human action recognition with global context-aware attention LSTM networks. IEEE Trans. Image Process. 27(4), 1586–1599 (2017).
Article ADS MathSciNet Google Scholar
Caetano, C., Sena, J. & Brémond, F. et al. Skelemotion: A new representation of skeleton joint sequences based on motion information for 3D action recognition. In 2019 16th IEEE international conference on advanced video and signal based surveillance (AVSS). IEEE, 1–8 (2019).
Duan, H., Zhao, Y. & Chen, K. et al. Revisiting skeleton-based action recognition. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2969–2978 (2022).
Yan, S., **ong, Y. & Lin, D. Spatial temporal graph convolutional networks for skeleton-based action recognition. In Proceedings of the AAAI conference on artificial intelligence. 32(1) (2018).
Shi, L., Zhang, Y., Cheng, J. et al. Two-stream adaptive graph convolutional networks for skeleton-based action recognition. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 12026–12035 (2019).
Shi, L. et al. Skeleton-based action recognition with multi-stream adaptive graph convolutional networks. IEEE Trans. Image Process. 29, 9532–9545 (2020).
Article ADS Google Scholar
Zhang, X., Xu, C. & Tao, D. Context aware graph convolution for skeleton-based action recognition. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 14333–14342 (2020).
Chen, S., Xu, K. & Jiang, X. et al. Spatiotemporal-spectral graph convolutional networks for skeleton-based action recognition. In 2021 IEEE international conference on multimedia & expo workshops (ICMEW). IEEE, 1–6 (2021).
Chen, Y., Zhang, Z. & Yuan, C. et al. Channel-wise topology refinement graph convolution for skeleton-based action recognition. In Proceedings of the IEEE/CVF international conference on computer vision. 13359–13368 (2021).
Lee, J., Lee, M. & Lee, D. et al. Hierarchically decomposed graph convolutional networks for skeleton-based action recognition. In Proceedings of the IEEE/CVF international conference on computer vision. 10444–10453 (2023).
Liu, Y. et al. Skeleton-based human action recognition via large-kernel attention graph convolutional network. IEEE Trans. Visual Comput. Graph. 29(5), 2575–2585 (2023).
Article Google Scholar
Myung, W. et al. DeGCN: Deformable graph convolutional networks for skeleton-based action recognition. IEEE Trans. Image Process. 33, 2477–2490 (2024).
Article ADS PubMed Google Scholar
**e, J., Meng, Y. & Zhao, Y. et al. Dynamic semantic-based spatial graph convolution network for skeleton-based human action recognition. In Proceedings of the AAAI conference on artificial intelligence. 38(6), 6225–6233 (2024).
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T. & Houlsby, N. An image is worth 16x16 words: Transformers for image recognition at scale. ar**v:2010.11929 (2020).
Wang, W., **e, E. & Li, X. et al. Pyramid vision transformer: A versatile backbone for dense prediction without convolutions. In Proceedings of the IEEE/CVF international conference on computer vision. 568–578 (2021).
Liu, Z., Lin, Y. & Cao, Y. et al. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision. 10012–10022 (2021).
Caron, M., Touvron, H. & Misra, I. et al. Emerging properties in self-supervised vision transformers. In Proceedings of the IEEE/CVF international conference on computer vision. 9650–9660 (2021).
Zhai, X., Kolesnikov, A. & Houlsby, N. et al. Scaling vision transformers. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 12104–12113 (2022).
**e, E. et al. SegFormer: Simple and efficient design for semantic segmentation with transformers. Adv. Neural. Inf. Process. Syst. 34, 12077–12090 (2021).
Google Scholar
Hatamizadeh, A., Tang, Y. & Nath, V. et al. Unetr: Transformers for 3D medical image segmentation. In Proceedings of the IEEE/CVF winter conference on applications of computer vision. 574–584 (2022).
Shi, L., Zhang, Y. & Cheng, J. et al. Adaptive spectral graph convolutional networks for skeleton-based action recognition. https://doi.org/10.48550/ar**v.1805.07694 (2018).
Cheng, K., Zhang, Y. & He, X. et al. Skeleton-based action recognition with shift graph convolutional network. IEEE. https://doi.org/10.1109/CVPR42600.2020.00026 (2020).
Shi, L., Zhang, Y. & Cheng, J. et al. Decoupled spatial-temporal attention network for skeleton-based action-gesture recognition. https://doi.org/10.1007/978-3-030-69541-5_3 (2021).
**e, J. et al. Attention adjacency matrix based graph convolutional networks for skeleton-based action recognition-ScienceDirect. Neurocomputing 440, 230–239 (2021).
Article Google Scholar

Download references

Funding

This research was funded by Shanxi Province Science and Technology Major Special Project (No. 202201150401021), Natural Science Foundation of Shanxi Province (202303021211153) and Shanxi Key Laboratory of Machine Vision and Virtual Reality (No. 447-110103).

Author information

Authors and Affiliations

School of Computer Science and Technology, North University of China, Taiyuan, 030051, China
Huiyan Han, Hongwei Zeng, Liqun Kuang, **e Han & Hongxin Xue
Shanxi Key Laboratory of Machine Vision and Virtual Reality, Taiyuan, 030051, China
Huiyan Han, Hongwei Zeng, Liqun Kuang, **e Han & Hongxin Xue
Shanxi Vision Information Processing and Intelligent Robot Engineering Research Center, Taiyuan, 030051, China
Huiyan Han, Hongwei Zeng, Liqun Kuang, **e Han & Hongxin Xue

Authors

Huiyan Han
View author publications
You can also search for this author in PubMed Google Scholar
Hongwei Zeng
View author publications
You can also search for this author in PubMed Google Scholar
Liqun Kuang
View author publications
You can also search for this author in PubMed Google Scholar
**e Han
View author publications
You can also search for this author in PubMed Google Scholar
Hongxin Xue
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

H.H.: Methodology Providing and Implementing, Original Draft Writing. H.Z.: Manuscript Reviewing and Reediting. L.K.: Experimentation Testing. X.H.: Manuscript Reviewing and Funding acquisition. H.X.:Manuscript Reviewing.

Corresponding author

Correspondence to Huiyan Han.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher's note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Han, H., Zeng, H., Kuang, L. et al. A human activity recognition method based on Vision Transformer. Sci Rep 14, 15310 (2024). https://doi.org/10.1038/s41598-024-65850-3

Download citation

Received: 13 December 2023
Accepted: 25 June 2024
Published: 03 July 2024
DOI: https://doi.org/10.1038/s41598-024-65850-3
Springer Nature Limited

A human activity recognition method based on Vision Transformer

Abstract

Similar content being viewed by others

A review of convolutional neural networks in computer vision

PointDMIG: a dynamic motion-informed graph neural network for 3D action recognition

GaitRA: triple-branch multimodal gait recognition with larger effective receptive fields and mixed attention

Introduction

Motivation 1

Motivation 2

Our contributions

Related work

ViT

AGCL in 2s-AGCN

Materials and methods

eAGCL

Positional encoder

Transformer encoder

Multi-layer perceptron (MLP) classifier

Ethical approval and informed consent

Experiments and results

Skeleton-based action datasets

Experiment description

Training comparison with SOTA methods

Experiments on NTU RGB+D 60

Experiments on NTU RGB+D 120

Experiments on Kinetics-Skeleton 400

Test accuracy comparison with SOTA

Experiments on NTU RGB+D 60

Experiments on NTU RGB+D 120

Top 5 test results

Experiments on Kinetics-Skeleton 400

Experiment on real-world datasets

Ablation experiment

Conclusions

Data availability

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Additional information

Publisher's note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation