1 Introduction

Most recognition methods are based on segmented video clips (each clip contains only one full action) to carry out training and testing. Because the most widely used benchmarks of depth camera data only provide with the segmented clips, some researchers [1] use the “simulation of continuous action video” to test their action segmentation methods. Nevertheless, MSR 3D Online Action Dataset [2] was proposed recently and has been used as a well-established benchmark in [3, 4]. This dataset contains videos of complete natural individual daily activities and single clip contains more than one action type. In this paper, we will focus on the segmentation and recognition of real human motion depth data streams.

In order to use the powerful convolution neural network (CNN), [5] proposed the DMM-Pyramid and DMM-Cube features to organize the raw depth sequence into formats which can be accepted by convolution models. Experimental results show that the method is very robust in action recognition accuracy. However, CNN itself can not be modeled on the changes in time series. But recurrent neural network (RNN) [6] is able to model the samples automatically without destroying the temporal information.

Inspired by the success of using RNN in other related fields, we firstly try to utilize the RNN to recognize human motion based on depth data. We use single video frames as samples for training and testing, and achieve the recognition results of the entire video after temporal smoothness. Using single frames will cause that the number of samples increase exponentially. So we can’t directly reuse the middle features which are accepted by CNN models. So we use Orderlet features and Grid-based Average Depth (GbAD) proposed in this paper.

Because action recognition is based on the single frame, action segmentation work is more convenient. Now we can get a probability distribution for each frame, which can also be regarded as a score for each action class. We follow the maximum sub array search method in [7], dynamic backward backtracking searching can help us to get the highest cumulative score of the action of the particular sub sequence, and ultimately to complete the segmentation.

The key contributions of this work can be summarized as follows:

  1. 1.

    We propose to apply the recurrent neural networks to recognize human motion based on depth data. RNN can directly model the depth sequence on the time axis, and learn the temporal information more naturally.

  2. 2.

    We propose the feature GbAD to roughly describe the shape of the hand. This feature is sufficient for our task.

  3. 3.

    We evaluate our models on the MSR 3D Online Action Dataset in comparison with the state-of-the-art methods. Experimental results show that the proposed models outperforms other ones.

2 Related Work

Li et al. [8] model the dynamics of the action by building an action graph and describe the salient postures by a bag-of-points (BOPs). It’s an effective method which is similar to some traditional 2D silhouette-based action recognition methods. The method does not perform well in the cross subject test due to some significant variations in different subjects from MSR Action3D dataset.

Yang et al. [9] are motivated by the success of Histograms of Oriented Gradients (HOG) in human detection. They extract Multi-perspective HOG descriptors from DMM as representations of human actions. They also illustrate how many frames are sufficient to build DMM-HOG representation and give satisfactory experimental results on MSR Action3D dataset. Before that, they have proposed an EigenJoints-based action recognition system by using a NBNN classifier [10] with the same goal.

In order to deal with the problems of noise and occlusion in depth maps, Wang et al. extracts semi-local features called random occupancy pattern (ROP) features [11]. They propose a weighted sampling algorithm to reduce the computational cost and claim that their method performs better in accuracy and computationally efficiency than SVM trained by raw data. After that they further propose Local Occupancy Patterns (LOP) features [12] which are similar to ROP in some case and improve their results to some extent.

3 Recurrent Neural Networks

3.1 Brief Introduction of RNN

The purpose of recurrent neural network originally being proposed is to process sequence data, as well as some tasks that fully-connected network or convolutional neural network are difficult to deal with. For example, the analysis of the semantic meaning of a word in sentence is often judged by the words beside it, because they are not independent of each other. Similarly, in our judgment of the action category for the current frame, there is a chance to get a higher accuracy with the help of the information of previous frames. And RNN in such an situation is able to play its own advantages.

The RNN that we used in this paper is also known as Elman Network [13]. Vector X(t) is concatenated with the feature vector W(t) of the current frame and the output vector \(S(t-1)\) of hidden layer in the last training step. The network contains three layers of input, hidden and output and is trained by the standard back propagation. The values are calculated by the following formula:

$$X(t) = [W(t)^TS(t-1)^T]^T$$
$$s_j(t) = f(\sum \limits _{i}{x_i(t)u_{ji}})$$
$$y_k(t) = g(\sum \limits _{j}{s_j(t)v_{kj}})$$

The cross entropy criterion is used to obtain an error vector at the output layer, which is then propagated to the hidden layer. A part of sampling data is used as a validation set. The training algorithm need to decide whether to terminate itself or adjust the learning rate according to the verification results on the validation set after each round of iteration. In our experiments, the training results generally tend to converge at the 200’th iteration. There is a problem that whether the back propagation is sufficient to train such a network especially when we assume that the current frame action label is affected by the previous frame data. It is difficult to determine how much useful information is retained in the hidden layer and the researchers have not yet figured out the problem. Which is also the future needs to continue to explore the solution of the problem. We think it need to be solved in the future.

3.2 Backpropagation Through Time

Backpropagation through time (BPTT) [14] can be seen as a simple extension of BP. The BPTT method, the deviation can be propagated within the specified time step, then the network will learn the temporal information of the specified range. The concrete implementation can be referred to [15].

In addition, Mikolov et al. [16] mentioned that it is generally required to train multiple networks with different initial weight or different number of units at the same time to merge in order to further enhance the power of RNN. In the experiment of this paper, we will also make a linear combination of multiple networks and observe the impact of it.

4 Orderlet Features and Grid-based Average Depth

In this paper, we mainly use the orderlet features and Grid-based Average Depth (GbAD) to describe the body skeletons and depth sequences of Kinect data. The orderlet features only reflect the relative relationship between the eigenvalues, so they are not sensitive to the small errors and the difference between the human body compared with the numerical features. The orderlet features are very suitable for use for skeleton data with much noise. In addition, orderlet feature can be applied to recognize the continuous action because we can extract it based on single frame data. Besides, we present GbAD as the descriptor of gesture. The descriptor GbAD roughly describes the shape of the hand.

4.1 Basic Skeleton Feature

In this section, we extract features based on single frame. For a given frame \(I_t\), Skeleton node is defined as \(S^t={s_1^t, s_2^t,\ldots , s_{N_s}^t}\), where \(s_i^t = (x_i^t, y_i^t, z_i^t)\) is the coordinate of the node i at the t’t frame, \(N_s=20\) is the quantity of a single complete skeleton. Here we use the following three basic features:

  • Euclidean distance between two nodes:

    $$\lambda ^{(1)} = ||s_i^t-s_j^t||,$$
  • Simple node coordinate information:

    $$\lambda ^{(2)} = x_i^t\quad or\quad y_i^t\quad or\quad z_i^t,$$
  • The position change of the node coordinates in a certain time step (Euclidean distance):

    $$\lambda ^{(3)} = ||s_i^t-s_j^{t-\varDelta }||,$$

    where \(\varDelta \) is the time step length.

We all know that there are many inherent instability in human movement. The same action performed by the same person may be different let alone by different people. Therefore, using these basic features of the original value is not robust. In this section the orderlet feature is used to deal with this problem.

Fig. 1.
figure 1

Skeleton orderlet (Color figure online)

As shown in Fig. 1, the absolute value of the distance between the two wrist joints (two green nodes) may be somewhat different at the beginning/ongoing/ending of the action. However, the its order of the length in all distance between each nodes is stable at all stages of the action.

We define a orderlet feature p with size n as:

$$p = (O_p,k)$$
$$O_p = [\lambda _{i_1}^f, \lambda _{i_2}^f,\ldots , \lambda _{i_n}^f]$$

where k is the index of minimum value in the vector \(O_p\), f represents one of the three basic features. So for a given frame \(I^t\), its value on orderlet p can be defined as:

$$v_p(I^t) = 1:0\quad ?\quad \lambda _{i_k}^f\le \lambda _{i_j}^f\quad for\quad all\quad \lambda _{i_j}^f\in O_p.$$

And for a given sequence v with T frames, its value on orderlet p can be defined as:

$$V_p(\upsilon ) = \sum _{t=1}^Tv_p(I^t)$$

4.2 Grid-based Average Depth

When we observe gestures in video, it is found that related skeleton information from Kinect is often deficient and inaccurate. That is to say, we can not describe gesture information only with skeleton information. So we present Grid-based Average Depth (GbAD) as the descriptor of gesture.

We divide the hand-related region into many grids. For each grid, we calculate the average depth value of all pixels in the grid. We arrangement the average depth values of grids and get the descriptor GbAD.

Fig. 2.
figure 2

GbAD. (A) A depth frame after preprocessing. (B) The depth frame with grids.

First, we know where wrists and elbows are with the skeleton information from Kinect, and we can infer where hands can be. That is to say, we can determine the hand-related regions in depth frame. Second, we divide the hand-related region into many grids. The rule about dividing is simple: for each hand-related region, we take wrist as the center point and make a semicircle, and the semicircle is divided into many grids along the radius and angle. Finally, for each grid, we calculate the average depth value of all pixels in the grid. The average depth values of grids are arrangemented as the descriptor GbAD. A example is shown in Fig. 2.

5 Online Motion Recognition

With the orderlet features, GbAD and RNN classification model, we have been able to recognize action on a single frame. But it is usually not reliable to judge a complete action by just one frame. So we use a time window to make a smooth operation of the single frame motion recognition of the whole video stream. Because the duration of action of different types will be different (there are also differences between different presentations), the size of the time window is hard to set, so we will use a window with dynamic time length. Firstly, we can get a probability distribution of each action class for each frame. Assuming the current frame t is recognized as action \(\gamma \) by our method and the probability is \(P(\gamma ,t)\). Here we denote that the positive score for the \(\gamma \) of frame t is \(P(\gamma ,t)\), the negative score is \(P(\gamma ,t)-1\). Then we use the backtracking search to find a window w which can make the maximum accumulated score for the action \(\gamma \).

Following the maximum subarray search method, we use dynamic programming to find the best comprehensive score of the current frame to a certain action label and avoid the backward search for each frame. The best comprehensive score of the current frame t can be defined as:

$$S(\gamma ,t) = max(0,S(\gamma ,t-1) + P(\gamma ,t)), t>1$$
$$S(\gamma ,1) = P(\gamma ,1)$$

If the \(S(\gamma ,t)\) is less than 0 (or a specific threshold), its value will be reset to 0 and we believe that the current action is over and ready to start another action.

6 Experiment and Discussion

6.1 Dataset

MSR 3D Online Action Dataset is a dataset used to test the method of recognizing continuous human actions. It contains three forms of data (depth sequence, RGB video and human node coordinates). There are 7 action types in the dataset: Drinking, eating, using laptop, reading cellphone, making phone call, reading book, using remote. All of these actions are interactive movement of people and objects. The boundaries of the object in each frame are manually marked out in the dataset. But in order to get more general conclusions, we did not use these data in our experiments.

The dataset is designed to do three experiments:

  • Same-environment action recognition

  • Cross-environment action recognition

  • Continuous action recognition.

6.2 Results

Same-environment action recognition. We first validate our approach based on the train/test set with the same background environment. The proportion of sample number of training set and test set is 1:1, and we used a 2 fold cross validation. As shown in Table 1, our experimental results are superior to other methods, which to a certain extent proved the effectiveness of our feature extraction method combined with RNN classification model. In order to prove the advantages of RNN, we used SVM/random forest to do the same experiment. As shown in Table 2, using the RNN classification model is far better than using the SVM/RF. It is also proved that RNN can preserve more motion information after modeling the temporal information. In the same case, we have made the comparison between the experimental results of a single network and a number of networks under the same experimental conditions. The details are shown in Table 3.

Table 1. Same-environment action recognition results
Table 2. Same-environment action recognition results compared with SVM/RF
Table 3. Same-environment action recognition results compared with different number of networks

Cross-environment action recognition. This experiment requires the recognition method to get much higher robustness and generalization ability. As shown in Table 4, most of the existing methods in this test have a rapid decline in accuracy. Our method did not perform as well as itself in the same-environment experiment. Compared with the traditional classifier, neural network is more prone to the problem of over-fitting (especially when the quantity of training samples is not very large). In the cross-environment action recognition experiment the problem will be more serious (i.e., poor generalization ability).

Table 4. Cross-environment action recognition results

Continuous action recognition. Different from the previous tests, here each sample contains multiple action. In addition to the 7 action types that were previously set in detail, there is a one more “background action” type, in fact, this type has no real sense but is just the connection between the various parts of the connection. The length of a single sample ranged from 30 s to 2 min where the proportion of “background action” is about 30%. Each frame of the sample is marked with an action label, but the dataset provider does not guarantee the accuracy of the motion boundary (in fact, it is difficult to unify the evaluation standard of human action segmentation) (Table 5).

Table 5. Continuous action recognition

7 Conclusion

In this paper, we focus on the continuous motion recognition which has more practical application value. The proposed method is based on the single frame data and can be used for the recognition of real time video streams. We extract Orderlet features from the skeleton data, concerning about the relative relation between the eigen values but not their absolute values. Also we present GbAD as the descriptor of gestures. Then we utilize RNN to do the human action recognition work, taking advantage of self feedback and propagation in the time domain. We also train a number of networks which have different initial weights, the number of hidden cells and apply a linear combination to them to obtain better recognition accuracy.

Experiments performed on the benchmark dataset show that the method proposed in this paper is superior to some other existed methods. But the poor performance in the cross-environment experiment indicates that the generalization ability of the method is partly defective, which is the point where we hope to improve in the future.