1 Introduction

Time series data grant a great potential for various prediction tasks [1], and time series classification is one of the most challenging tasks in data mining [2]. A typical time series classification task involves multiple variables, represented by multiple data streams each corresponding to a variable. This is known as multivariate time series classification (MTSC)—given a group of time-aligned segments of these data streams, the task is to assign the correct classification label to it. MTSC has demonstrated significance in various applications, such as activity recognition [3], disease diagnosis [4], and automatic device classification [5], etc. Multivariate time series contain the temporal information from different sources, hence, measuring the interaction of sources and learning the temporal representations are the keys to realizing accurate MTSC [6]. Different tasks have different requirements for the classifier, making building a generalized used classifier a challenge. For example, EEG signal based MTSC can be focused on many different goals such as the recognition of emotion [7, 8], decoding cognitive skills [9], recognition, investigation of sustained attention, detection of sleep disorder, decoding of cognitive tasks in brain-computer-interface, etc. In EEG classification, the performance is sensitive to many parameters together such as the number of recording channels, i.e., feature dimension, recording time length, i.e., the number of features, number of the individuals in each group, feature extraction method and, classifier’s architecture.

Traditional methods for time series classification include distance-based models (e.g., k-nearest neighbors) and feature-based models (e.g., random forest [10] and support vector machine [11]). These models highly rely on manually-defined features, which are heuristic and task-dependent [12]. Also, it takes the expertise and considerable time of domain experts to design such features. Furthermore, conventional machine learning (ML) techniques have limitations in processing high-dimension data and representing complicated functions efficiently [13].

Recently, deep learning (DL) has gained popularity in computer vision, natural language processing, and data mining, thanks to its advantages in capturing complicated, nonlinear relations from massive data [14]. Deep neural networks usually stack multiple neural layers for automatic feature extraction and representation learning [15]. Many neural network architectures, such as Recurrent Neural Networks (RNN), Convolutional Neural Networks (CNN), Transformer [16], Long Short-Term Memory (LSTM) [17], and Gated Recurrent Unit (GRU) [18], have been applied for time series analysis. In particular, RNN sends the prior output to the next input layer to facilitate temporal feature extraction; therefore, it takes a long training time and cannot support parallel computation. CNN can extract temporal feature and be parallelized during training to fully exploit the power of Graphics Processing Units (GPUs); however, it faces challenges in capturing long-range temporal dependencies and is, therefore, less used for time series classification. Transformer [16] has recently emerged as promising solution to multivariate time series classification. While transformer supports both parallel computing and efficient temporal feature extraction, it requires massive parameters for the multiple fully connected layers, making the training extremely time-consuming. Furthermore, transformer suffers overfitting on small datasets [19], and faces challenges in capturing short-range temporal information [20]. Besides, existing solutions to MTSC commonly require careful adjustments of architectures and parameters to deal with time series of various lengths. This is a critical yet little studied issue in existing time series classification research.

To summarize, ML methods are expertise-dependent and difficult for representing complicated non-linear functions. Among the DL methods, CNN is efficient for training and inferencing but challenging for capturing long dependencies; RNN can effectively learn the temporal representations of long temporal features, but is computationally expensive; transformer contains too many parameters, making it easy to prone to overfitting on small size datasets. We aim for accurate MTSC that can adapt to time series of various lengths to address the above deficiencies of existing studies. To this end, we propose a novel CNN architecture called Attentional Gated Res2Net (AGRes2Net) for MTSC. Our model can overcome the shortcoming of the standard CNN architecture by enabling the extraction of both global and local temporal features. It also has the capability to leverage multi-granular feature maps through channel-wise and block-wise attention mechanisms. In a nutshell, we make the following contributions in this paper:

  • We propose a novel AGRes2Net architecture for accurate MTSC. Our model can capture dependencies over various ranges and exploit the inter-variable relations to achieve high performance on time series of various lengths, making it feasible for various tasks.

  • We propose two attention mechanisms, namely channel-wise attention and block-wise attention, to leverage multi-granular temporal information for tasks with different data characteristics. The former has advantages on datasets with many variables, while the latter can effectively prevent overfitting on datasets with very few variables.

  • We conducted extensive experiments on 14 benchmark datasets to evaluate the model. A comparison with several baselines and state-of-the-art methods shows the superior performance of our model. Besides, plugging our model into MLSTM-FCN, a state-of-the-art CNN-RNN parallel model, demonstrates the model’s capability to improve existing models’ performance.

The remainder of the paper is organized as follows. Section 2 overviews the related work; Sect. 3 presents the proposed model and attention mechanisms; Sect. 4 reports our experiments and results; and finally, Sect. 5 gives the concluding remarks.

2 Related Work

2.1 Multivariate Time Series Classification

MTSC has been a longstanding problem and solved by traditional statistic and ML methods [21,22,23]. A representative example is k-Nearest Neighbors (KNN), which is proven outstanding in MTSC [24]. Its combination with Dynamic Time War** (DTW) can achieve even better performance [25, 26]. DL methods are increasingly applied to MTSC, given their capability in automatic feature extraction and learning complex relations from massive amounts of data [27,28,29]. Commonly used DL architectures include Recurrent Neural Networks (RNNs), Gated Recurrent Unit (GRU) [18], Convolutional Neural Networks (CNNs), Long Short-Term Memory (LSTM) [17], and Transformer [16]. And recent studies heavily rely on CNNs to overcome the efficiency and scalability issues with recurrent models (e.g., RNN, LSTM, and GRU) [30,31,32].

Traditionally, CNNs are used for computer vision tasks, such as image recognition [33], object detection [34,35,36], and semantic segmentation [37]. Recent studies [38,39,40,41,42] show 1D-CNN is promising for temporal feature extraction—the convolution computation can capture potential temporal patterns while the information fusion across channels can cope with the inter-relations among variables. Further, Inception [43] uses multiple parallel convolutional kernels of different sizes to address the challenged faced by CNNs in capturing long-range temporal dependencies [44, 45]. However, Inception’s receptive field has a restricted width, which limits its ability to capture long-range dependencies.

The combination of CNN and RNN represents an effort to exploit the advantages of both [46]. Hybrid CNN-RNN architectures generally follow a parallel or cascade style to facilitate temporal feature extraction in various ranges. For example, LSTM-FCN [47] uses CNN and RNN in parallel and achieves state-of-the-art performance on several benchmark datasets. Since LSTM-FCN employs RNN as a component, it cannot fully leverage the power of GPUs, leading to extended training time. In comparison, transformer [16] learns both temporal dependencies and inter-variable relations based on positional embedding and attention mechanism. It achieves state-of-the-art performance on several time-series datasets [48, 49] but suffers extended training time and overfitting on small datasets [19] due to its massive trainable parameters. It also finds difficulty in capturing short-range temporal information when compared with RNN.

2.2 Attention Mechanism

Attention mechanism was first used in the seq2seq model for machine translation [18]. A vanilla seq2seq model first feeds the input sequence to an encoder (which consists of multiple recurrent layers) [18] to generate hidden states and outputs. It then collects the hidden states of all the steps to represent the information of the input. An attention mechanism forces the model to learn the weights of hidden states in the decoder part during this process. Thus, the model can focus on a specific region of the input sequence, leading to a significant performance improvement.

Recent studies have designed different attention modules and applied them to various domains [50, 51]. Among them, Squeeze-and-Excitation Block (SE) [52] is widely used for various tasks thanks to its easiness of implementation. SE works in two steps. First, it uses global average pooling to obtain an information vector of feature maps from different channels. Then, it employs fully connected layers to capture the inter-relations between feature maps to learn the weights of feature maps and highlight the critical information.

3 Our Approach

We propose Attentional Gated Res2Net for accurate classification of multivariate time series of various lengths. In particular, we incorporate gating and attention mechanisms on top of Res2Net [53], where gates control the information flow across the groups of convolutional filters, and the attention module harnesses the feature maps at different levels of granularity.

The overall architecture of AGRes2Net (shown in Fig. 1) consists of two stages: Convolution and Attention. We illustrate these two stages in the following subsections, respectively.

Fig. 1
figure 1

The structure of Attentional Gated Res2Net. It consists of two stages: convolution and attention. The convolution stage feeds the input to a convolutional layer for channel expansion and then groups the output along the channel. Each group (except the first) conducts convolution based on its input and its precedent group’s output (passed through gates). The attention stage forces the model to consider the temporal information at different levels of granularity. Finally, the network uses a convolutional layer for channel compression and information fusion

3.1 Convolution Stage

We design the convolution stage based on Res2Net [53], a CNN backbone specially designed to achieve multi-scale receptive fields based on group convolution. Group convolution first appeared in AlexNet [54] and significantly reduced the number of the parameters in that model. It has since been adopted in many lightweight networks [55, 56] to generate a large number of feature maps with a small number of parameters.

Unlike conventional CNNs, which use a single set of filters to work on all channels, Res2Net includes multiple groups of filters and uses a separate group to handle each subset of channels. These filter groups are connected in a hierarchical, residual-like style, and they work as follows. First, a convolutional layer takes the input data and outputs a feature map for channel expansion. Then, the feature map is split into groups along the channel, generating groups of feature maps, i.e., input feature maps. Finally, for each input feature map, a separate group of filters extracts features and generates the corresponding output, i.e., an output feature map. In particular, when extracting features from an input feature map, the filter group also takes into account the output of the filter group that comes immediately before it. The whole process repeats until all input feature maps are processed.

Suppose X is the feature map obtained from channel expansion, and X is evenly divided into s groups, \(\{{\mathbf {x}}_i\}_{i=1}^{s}\), where \({\mathbf {x}}_i\) denotes the ith group. Each group contains an input feature map that has the same temporal size but contains only 1/s of the channels in X. Let \({\mathbf {K}}_i\) be the convolution operation. Then, given an input feature map \({\mathbf {x}}_i\), the convolution output, \({\mathbf {y}}_i\), is calculated as follows:

$$\begin{aligned} {\mathbf {y}}_{i}=\left\{ \begin{array}{ll} {{\mathbf {x}}_{i}} &{} {i=1} \\ {{\mathbf {K}}_{i}\left( {\mathbf {x}}_{i}\right) } &{} {i=2} \\ {{\mathbf {K}}_{i}\left( {\mathbf {x}}_{i}+{\mathbf {y}}_{i-1}\right) } &{} {2<i \leqslant s.} \end{array}\right. \end{aligned}$$
(1)

By feeding the concatenation of all the outputs to a convolutional layer, Res2Net achieves multi-scale receptive fields to facilitate multivariate time series classification. However, it has difficulty in controlling the information flow between the feature-map groups—at each step, \({{\textbf {y}}}_{i}\) is always fully sent to the next group regardless of whether it avails or harms the model’s performance.

Addressing this limitation is important as it enables to model to control how to weigh the precedent output feature map against the current input feature map in an input-dependent manner. This, in turn, mitigates the problem of vanishing gradients without having to take long delays. To this end, we introduce the gating mechanism [31] into Res2Net at the convolutional stage to enhance feature extraction. Specifically, in our model (shown in Fig. 1), all groups of feature maps (except the first) are sent to convolutional layers for feature extraction, and a gating unit lies between each pair of adjacent feature-map groups to control how much information flows from the precedent to the current group. Given a feature-map group (or more specifically, input feature map), \({{\textbf {x}}}_{i}\), the value of the corresponding gate, \({{\textbf {g}}}_{i}\), is calculated as follows:

$$\begin{aligned} {\mathbf {g}}_{i}=\tanh \left( a\left( {\text {concat}}\left( a({\mathbf {y}}_{i-1}), a\left( {\mathbf {x}}_{i}\right) \right) \right) \right) . \end{aligned}$$
(2)

where a can be either fully-connected or 1-D convolutional layers, concat is the concatenation operation, and tanh is the activation function commonly used for gates.

Note that, we only use the precedent output feature map \({{\textbf {y}}}_{{i-1}}\) and the current input feature map \({{\textbf {x}}}_{i}\) to calculate the gate—this is different from the gating mechanism in [31]. More specifically, we omit the undivided feature map X as it contains redundant information and does not significantly improve the performance. Eventually, after the convolution stage, we obtain \({\mathbf {y}}_i\) as follows:

$$\begin{aligned} {\mathbf {y}}_{i}=\left\{ \begin{array}{ll} {{\mathbf {x}}_{i}} &{} {i=1} \\ {{\mathbf {K}}_{i}\left( {\mathbf {x}}_{i}\right) } &{} {i=2} \\ {{\mathbf {K}}_{i}\left( {\mathbf {x}}_{i}+{\mathbf {g}}_{i}\cdot {\mathbf {y}}_{i-1}\right) } &{} {2<i \leqslant s}. \end{array}\right. \end{aligned}$$
(3)

3.2 Attention Stage

The convolution stage only considers the information flow between adjacent feature-map groups. As such, it limits the model’s ability to capture the dependencies between groups that have long distances in-between. In this regard, we design an attention stage to attend to a certain part when processing output feature maps. In particular, we propose two types of attention modules, namely channel-wise attention module and block-wise attention module, to harness multi-granular temporal patterns effectively.

3.2.1 Channel-wise Attention

Channel-wise attention captures the relations between channels of the convolution stage’s output, i.e., output feature maps, \(\{{\mathbf {y}}_i\}_{i=1}^{s}\), where s is the number of feature-map groups in the convolution stage.

Suppose every \({{\textbf {y}}}_{i}\) contains the same number of channels, say J channels—this is reasonable as we divide the original feature map X evenly along the channel. Let \({{\textbf {h}}}_{{i,j}}\) be the feature map for the jth channel of \({{\textbf {y}}}_{i}\). We use three fully-connected layers to learn the query, key, and value of \({{\textbf {h}}}_{{i,j}}\) (denoted by \({{\textbf {q}}}_{{i,j}}\), \({{\textbf {k}}}_{{i,j}}\), and \({{\textbf {v}}}_{{i,j}}\), respectively). Similarly, we denote by \({{\textbf {q}}}_{{m,n}}\), \({{\textbf {k}}}_{{m,n}}\), and \({{\textbf {v}}}_{{m,n}}\) the query, key, and value of \({{\textbf {h}}}_{{m,n}}\), and the feature map for the nth channel of \({{\textbf {y}}}_{m}\). Given two different feature maps, \({{\textbf {h}}}_{i,j}\) and \({{\textbf {h}}}_{{m,n}}\), we calculate the channel-wise attention as follows:

$$\begin{aligned} \mathbf {attention}\left( {\mathbf {q}}_{i, j}, {\mathbf {k}}_{m, n}\right) =\frac{{\mathbf {q}}_{i, j} {\mathbf {k}}_{m, n}^{T}}{\sqrt{J}} \end{aligned}$$
(4)

Once computed, we can update the feature map of every channel according to its relations with all the other feature maps. As the feature maps contain temporal information within various ranges, channel-wise attention can capture temporal dependencies at multiple levels of granularity. Based on the above, the updated feature map \(\tilde{{\mathbf {h}}}_{i,j}\) can be calculated as follows:

$$\begin{aligned} \tilde{{\mathbf {h}}}_{i,j}= \sum _{s} \sum _{J} {\text {Softmax}}\left( \frac{\mathbf {attention}\left( {\mathbf {q}}_{i, j}, {\mathbf {k}}_{m, n}\right) }{\sum _{s} \sum _{J} \mathbf {attention}\left( {\mathbf {q}}_{i, j}, {\mathbf {k}}_{m, n}\right) }\right) {\mathbf {v}}_{m, n} \end{aligned}$$
(5)

Given s output feature maps each having J channels with k dimensions, the total number of feature maps for channel-wise attention is \(s \times J\), resulting in the computational complexity of \({\mathcal {O}}\left( (s \times J)^{2} k\right) \).

3.2.2 Block-wise Attention

Block-wise attention regards each \({{\textbf {y}}}_{i}\) as an individual block that contains temporal information at a certain granularity. Instead of calculating attention values along the channel, block-wise attention directly feeds \({{\textbf {y}}}_{i}\) to the fully-connected layers to calculate the corresponding query, key, and value. Block-wise attention has advantages in mitigating overfitting as it considers sparse relations when computing the attention.

Suppose \({{\textbf {y}}}_{i}\) and \({{\textbf {y}}}_{m}\) are two output feature maps. We denote by \({{\textbf {q}}}_{{i}}\), \({{\textbf {k}}}_{{i}}\) and \({{\textbf {v}}}_{{i}}\) the query, key and value of \({{\textbf {y}}}_{i}\); similarly, we denote by \({{\textbf {q}}}_{{m}}\), \({{\textbf {k}}}_{{m}}\) and \({{\textbf {v}}}_{{m}}\) the query, key and value of \({{\textbf {y}}}_{m}\). Then, we calculate the block-wise attention as follows:

$$\begin{aligned} \mathbf {attention}\left( {\mathbf {q}}_{i}, {\mathbf {k}}_{m}\right) =\frac{{\mathbf {q}}_{i} {\mathbf {k}}_{m}^{T}}{\sqrt{s}} \end{aligned}$$
(6)

Once computed, we can update the feature map of every block according to their relations with all the other feature maps. And the updated feature map for each block, \(\tilde{{\mathbf {y}}}_{i}\), can be calculated as follows:

$$\begin{aligned} \tilde{{\mathbf {y}}}_{i}= \sum _{s} {\text {Softmax}}\left( \frac{\mathbf {attention}\left( {\mathbf {q}}_{i}, {\mathbf {k}}_{j}\right) }{\sum _{s} \mathbf {attention}\left( {\mathbf {q}}_{i}, {\mathbf {k}}_{j}\right) }\right) {\mathbf {v}}_{j} \end{aligned}$$
(7)

Given s feature maps, each having J channels with k dimensions, the computational complexity of block-wise attention is \({\mathcal {O}}\left( s^{2} Jk\right) \).

4 Experiments

This section reports our extensive experiments to evaluate our proposed approach, including comparisons against baselines, ablation studies, and parameter studies on several public time-series datasets. We demonstrate that our approach can be used as a plugin to improve the performance of state-of-the-art methods and provide practical advice on how to adapt our approach to a specific problem.

4.1 Datasets

We conducted experiments on 14 public multivariate time series datasets (summarized in Table 1). These datasets cover various tasks from different application domains, such as activity recognition, EEG classification, and weather forecasting. They contain time series of various lengths with different numbers of variables. We have carefully selected these datasets to reflect applications in various domains and ensure that they are diverse enough in the length and variable number of time series to reflect different difficulty levels in real-world multivariate time-series classification problems.

Table 1 A list of our experimental datasets

4.2 Baseline Methods

We selected several competitive baselines and state-of-the-art (SOTA) methods to compare with our approach.

  • Res2Net [53]: this is a CNN backbone that uses group convolution and hierarchical residual-like connections between convolutional filter groups to achieve multi-scale receptive fields.

  • GRes2Net [31]: this work incorporates gates in Res2Net, where the gates’ values are calculated based on a different method from ours—it additionally takes into account the original feature map before it is divided into groups when calculating gates’ values.

  • Res2Net+SE: this work combines Res2Net with a Squeeze-and-Excitation Block (SE) [52] to leverage the effectiveness of attention modules.

  • GRes2Net+SE: this work combines GRes2Net with SE to leverage the effectiveness of attention modules.

We briefly introduce the SOTA methods for the experimental datasets below. A full list of SOTA methods is given in Table 1.

  • MLSTM-FCN [47]: a multivariate LSTM fully convolutional network that concatenates the outputs of two parallel blocks: a fully convolutional block (embedded with SEs) and an LSTM block. It is a variant of LSTM-FCN.

  • MALSTM-FCN [47]: a multivariate attention LSTM fully convolutional network, which resembles MLSTM-FCN but replaces LSTM cells with attention LSTM cells.

  • MUSE [58]: a model that extracts and filters multivariate features by encoding context information into each feature.

  • InceptionTime [60]: a CNN-based model transferred from computer vision to time series classification, which stacks multiple parallel convolutional filters for temporal feature extraction.

  • Time Series Forest [21]: an ensemble tree-based method that employs a combination of entropy gain and a distance measure to evaluate the differences between time-series sequences.

  • Canonical Interval Forest [61]: a model that refines Time Series Forest by upgrading the interval-based component.

  • Dynamic Time War** (DTW) [62]: a traditional distance-based machine learning method for time series analysis.

  • Random Convolutional Kernel Transform (ROCKET) [63]: a CNN-based model that uses random convolutional kernels to extract multi-granular temporal features.

4.3 Model Configuration and Evaluation Metric

Table 2 Experiment configuration settings

We followed the methods as illustrated in the SOTA methods to preprocess the datasets. In particular, we normalized each dataset to zero mean and unit standard deviation. We also applied zero paddings to cope with sequences with various lengths in the same training set. The experimental results of each method were obtained under the optimal or suggested settings as provided in the original paper.

To ensure a fair comparison, we set all the models based on Res2Net, GRes2Net, and our approach contained the same number of feature-map groups and used identical filters for each group.

We used our model as the backbone for feature extraction and trained our model for 500 training epochs using Adam [4.11 Effectiveness of Our Model as a Plugin

We use MLSTM-FCN, the SOTA architecture on most datasets (as shown in Table 1), to demonstrate the effectiveness of our model as a plugin. The original MLSTM-FCN follows a CNN-LSTM parallel architecture. The input goes through multiple LSTMs and CNNs, and the outputs are concatenated and go through a fully connected layer for information fusion. We conducted this experiment by replacing the original convolutional modules of MLSTM-FCN with our model while preserving the architecture and all the other parts in MLSTM-FCN.

We show the comparison results on two datasets, AREM and Gesture Phase, to demonstrate the impact of our model on the overall performance of MLSTM-FCN. Specifically, we adopted block-wise attention on the AREM dataset and channel-wise attention on the Gesture Phase dataset without particular reasons. We omit to show the results on other datasets as they draw similar conclusions.

The results (Fig. 3) show a significant improvement in the classification accuracy of MLSTM-FCN on both datasets after the replacement, demonstrating the positive effect of our model on the performance of existing multivariate time series classification models when used as a plugin.

Fig. 3
figure 3

Accuracy comparison between the vanilla MLSTM-FCN (blue bar) and the MLSTM-FCN where our model replaces the convolutional modules (orange bar). Block-wise attention and channel-wise attention are applied to the AREM dataset and the Gesture Phase dataset, respectively

4.12 Exploring the Threshold for Choosing Channel-wise Attention and Block-wise Attention

As discussed in the previous section, channel-wise attention performs better and vice versa. This section further explores whether a standard threshold exists for choosing the proper attention module. We select two datasets, LSST and HeartBeat, for experiments, because they contain many variables and channel-wise attention performs better than block-wise attention, and we can use dimension reduction methods to tune the variable numbers to find when the block-wise attention performs better. We use PCA to gradually control the variable numbers. The results can be seen in Tables 11 and 12.

According to the results, we can see the thresholds of the two datasets are different (3 for LSST and 2 for HeartBeat). Besides, when we reduce the variable number to 3 on LSST, the performance significantly decreases, making the results less convincing. According to the results, we can see the thresholds of the two datasets are different (3 for LSST and 2 for HeartBeat). Besides, when reducing the variable number to 3 on LSST, the performance of both attention modules is significantly decreased, making the results less convincing. Besides, from the results given in Table 3, we can see on the FingerMovements dataset, the block-wise attention performs better, while on the ECG dataset, the channel-wise attention outperforms block-wise attention. However, FingerMovements contains 28 variables, while ECG contains only 2 variables. To summarize, the threshold is case-by-case, and the standard threshold does not exist. Although we can follow a rule that using channel-wise attention is preferable in facing a dataset that has lots of variables (such as SelfRegulationSCP2, Action 3D, DuckDuckGeese, etc.), we still need to do empirical studies on each dataset to choose the proper attention module.

Table 11 Performance comparison based on the different variable numbers on LSST
Table 12 Performance comparison based on the different variable numbers on HeartBeat

4.13 Practical Advice

We offer several suggestions on applying our model to broader scenarios based on the above experimental results and our analysis:

  • Avoid very deep models: a wider model is generally more capable than a deeper model of addressing a general multivariate time series classification. We should prioritize constructing wider models rather than stacking more layers when faced with a new problem.

  • Focus on tuning the hyperparameter s: setting a larger s increases the number of convolutional-filter groups, leading to multiple receptive fields that capture temporal patterns in various ranges. Tuning the hyperparameter s is especially important for long time-series sequences to achieve the best possible performance. It is generally worthwhile to tune s ahead of investigating the optimal settings of other parameters.

  • Choose attention module based on variable number: the number of variables is, by far, the most useful single criterion for deciding which attention module to choose for our model, based on our experiments. As discussed, block-wise attention is preferred for sequences with a small number of variables, and channel-wise attention is more suitable for sequences with massive variables. More criteria include the number of sequences available for training, the number of classes, and the length of sequences, which must be figured out case by case.

5 Conclusion and Future Work

In this paper, we propose a novel deep learning architecture called Attentional Gated Res2Net for accurate multivariate time series classification. Our model comprehensively incorporates gates and two types of attention modules to capture multi-granular temporal information. We evaluate the model on diverse datasets that contain sequences of various lengths with a wide range of variable numbers. Our experiments show the model outperforms several baselines and state-of-the-art methods by a large margin. We thoroughly investigate the effect of different components and settings on the model’s performance and provide hands-on advice on applying our model to a new problem. Our test on plugging the model into a state-of-the-art architecture, MLSTM-FCN, demonstrates the potential for using our model as a plugin to improve the performance of existing models.

However, our attention modules increase the training, and inference is time-consuming facing the time series with many variables. Although some dimension reduction algorithms can alleviate the time consumption, it negatively influences classification accuracy. In the future, we aim to explore a pluggable feature selection module to select essential variables hence accelerating the training and inference process. Besides, our model still rely on manual fine-tuning for various datasets. We wish to make our model dynamic instead of static to ensure automatic adaptability based on the certain dataset.