Sparse transformer with local and seasonal adaptation for multivariate time series forecasting

Zhang, Yifan; Wu, Rui; Dascalu, Sergiu M.; Harris, Frederick C.

doi:10.1038/s41598-024-66886-1

Sparse transformer with local and seasonal adaptation for multivariate time series forecasting

Article
Open access
Published: 10 July 2024

Volume 14, article number 15909, (2024)
Cite this article

Download PDF

You have full access to this open access article

Scientific Reports

Sparse transformer with local and seasonal adaptation for multivariate time series forecasting

Download PDF

Yifan Zhang^1,3,
Rui Wu²,
Sergiu M. Dascalu¹ &
…
Frederick C. Harris Jr.¹

38 Accesses
Explore all metrics

Abstract

Transformers have achieved remarkable performance in multivariate time series(MTS) forecasting due to their capability to capture long-term dependencies. However, the canonical attention mechanism has two key limitations: (1) its quadratic time complexity limits the sequence length, and (2) it generates future values from the entire historical sequence. To address this, we propose a Dozer Attention mechanism consisting of three sparse components: (1) Local, each query exclusively attends to keys within a localized window of neighboring time steps. (2) Stride, enables each query to attend to keys at predefined intervals. (3) Vary, allows queries to selectively attend to keys from a subset of the historical sequence. Notably, the size of this subset dynamically expands as forecasting horizons extend. Those three components are designed to capture essential attributes of MTS data, including locality, seasonality, and global temporal dependencies. Additionally, we present the Dozerformer Framework, incorporating the Dozer Attention mechanism for the MTS forecasting task. We evaluated the proposed Dozerformer framework with recent state-of-the-art methods on nine benchmark datasets and confirmed its superior performance. The experimental results indicate that excluding a subset of historical time steps from the time series forecasting process does not compromise accuracy while significantly improving efficiency. Code is available at https://github.com/GRYGY1215/Dozerformer.

RSMformer: an efficient multiscale transformer-based framework for long sequence time-series forecasting

Article 04 January 2024

Double-Layer Attention for Long Sequence Time-Series Forecasting

Foreformer: an enhanced transformer-based framework for multivariate time series forecasting

Article 28 September 2022

Introduction

Multivariate time series (MTS) forecasting endeavors to adeptly model the dynamic evolution of multiple variables over time from their historical records, thereby facilitating the accurate prediction of future values. This task holds significant importance in various applications¹. The advent of deep learning has significantly advanced MTS forecasting, various methods are proposed based on Recurrent Neural Networks (RNN)^2,3,4, Convolutional Neural Networks (CNN)^5,6. In recent years, the Transformer has demonstrated remarkable efficacy in MTS forecasting. This notable performance can be attributed to its inherent capacity to effectively capture global temporal dependencies across the entirety of the historical sequence. The Transformer is initially proposed for natural language processing(NLP)⁷ tasks, with its primary objective of extracting information from words within sentences. It achieved great success in the NLP field and swiftly extended its impact in computer vision⁸ and time series analysis⁹. The critical challenge for applying transformers in MTS forecasting is to modify the attention mechanism based on the characteristics of time series data. Informer¹⁰ introduced a ProbSparse attention mechanism that restricts each key to interact only with a fixed number of the most significant queries. This model employs a generative-style decoder to produce predictions for multiple time steps in a single forward pass, aiming to capture temporal dependencies at the individual time step level. Autoformer¹¹ proposed an Auto-Correlation mechanism to model temporal dependencies at the sub-series level. It also introduced a time series decomposition block that utilizes moving averages to separate seasonal and trend components from raw MTS data. FEDformer¹² utilizes a Frequency Enhanced Block and Frequency Enhanced Attention to discern patterns in the frequency domain. It achieves linear computational complexity by selectively sampling a limited number of Fourier components for the attention mechanism’s queries, keys, and values. Pyraformer¹³ introduces a pyramidal graph organized as a tree structure where each parent node has several child nodes. It aims to capture data patterns at multiple scales by enabling each parent node to merge the sub-series from its child nodes. As a result, the scale of the graph exponentially increases based on the number of child nodes. Crossformer¹⁴ introduced a Two-Stage Attention mechanism that applies full attention across both time steps and variable dimensions. It also proposed a three-level framework designed to capture hierarchical latent representations of MTS data by merging tokens produced at lower levels. Dlinear¹⁵ challenges the effectiveness of the transformer by proposing a straightforward linear layer to infer historical records to predictions directly and outperforming aforementioned Transformer-based methods on benchmark datasets. MICN⁶ introduced a CNN-based framework that achieves linear computational complexity. It features a local-global structure designed to capture long-term temporal dependencies in MTS data, surpassing the performance of both the full attention and Auto-Correlation mechanisms¹¹. Notably, it incorporates a multi-scale hybrid decomposition block that uses various kernel sizes to decompose MTS data into seasonal and trend components effectively. PatchTST¹⁶ utilizes a full attention mechanism to model the temporal dependencies of individual variables at the sub-series level. It proposes a patching procedure that divides the input sequence into multiple time-step patches, significantly reducing computational complexity. Additionally, it adopts self-supervised learning to generate latent representations of MTS data

Nevertheless, these methods generate predictions for all future time steps from the entire historical sequence. Thus, they ignored that the look-back window of historical time steps is critical in generating accurate predictions. For example, predicting the value at horizon 1 using the entire historical sequence is suboptimal and inefficient. Figure 1b shows the full attention mechanism generating predictions, it generates predictions for all O future time steps from the entire historical sequence of length I. They also ignored the characteristics of MTS data, like locality and seasonality. Figure 1a shows the heatmap of correlation among 168 time steps of four datasets that indicate clear locality and seasonality. An attention mechanism should only utilize those time steps that have a high correlation with the prediction target.

To address those issues, we propose Dozerformer, a novel framework that incorporates an innovative attention mechanism comprising three key components: Local, Stride, and Vary. Figure 1b illustrates the concept of the Local, Stride, and Vary components. They leverage historical time steps based on the distinctive characteristics of MTS data. The Local component exclusively considers time steps within a window determined by the target time step. The Stride component selectively employs time steps that are at fixed intervals from the target time step. Lastly, the Vary component dynamically adjusts its use of historical time steps according to the forecasting horizon–employing a shorter historical sequence for shorter horizons and a longer one for longer horizons. The contributions of our work are summarized as follows:

We introduce a sequence-adaptive sparse Dozer Attention mechanism composed of three components: Local, Stride, and Vary. Each component aims to capture temporal dependencies from a select number of effective historical time steps. Notably, the Vary component dynamically expands its utilization of historical time steps as the forecasting horizons extend. To the best of our knowledge, Dozer Attention is the first mechanism that adaptively utilizes historical time steps based on the forecasting horizons. It addresses the limitation of existing attention mechanisms by overcoming their quadratic computational complexity, thereby enhancing efficiency.
We propose the Dozerformer framework for MTS forecasting, incorporating the Dozer Attention mechanism designed to model and predict the seasonal components within MTS data. It comprises a transformer encoder-decoder pair coupled with a linear layer, yielding a straightforward yet effective architecture. The Dozerformer framework achieved a slight improvement in accuracy and a significant improvement in efficiency.
The experimental results on nine benchmark datasets showcase Dozerformer’s superior performance in terms of both accuracy and efficiency when compared to recent state-of-the-art methods. Additionally, our extensive ablation studies demonstrate the effectiveness and robustness of the proposed Dozer Attention mechanism.

Related work

MTS forecasting

MTS forecasting involves predicting values ($X_{pred} \in \mathbb {R}^{O\times D}$) of D variables for future O time steps, given their historical records within the look-back window ($X\in \mathbb {R}^{I\times D}$) spanning I time steps. Deep learning has significantly advanced MTS forecasting. LSTNet² employs CNN and RNN to capture both short-term and long-term temporal dependencies. TPA-LSTM⁵ introduces a temporal pattern attention mechanism to weight hidden states across historical time steps. Graph WaveNet¹⁷ pioneers the use of Graph Neural Networks (GNN) for MTS forecasting, jointly training an adjacency matrix to model spatial dependencies. MTGNN¹⁸ is the first GNN-based method for generic MTS forecasting. It utilizes GNN to model correlations among multiple variables and CNN with different filter sizes to model multi-scale temporal dependencies. Transformers⁹ gained popularity in MTS forecasting due to their capacity to capture long-term temporal dependencies. The attention mechanism enables modeling temporal dependencies across all historical time steps. While CNN needs to be stacked, RNN needs to iteratively go through all time steps, and GNN is vulnerable to the adjacency matrix which indicates the correlation between variables. MICN⁶ replaces self-attention with a CNN-based local-global structure to model temporal dependencies across scales efficiently. Dlinear¹⁵ challenges transformer effectiveness with a simple linear layer, achieving state-of-the-art performance. PatchTST¹⁶ highlights the significance of transformers in modeling temporal dependencies at sub-series resolution, independently of variables. In summary, transformers have pushed the boundaries of MTS forecasting.

Sparse self attentions

Transformers have achieved significant achievements in various domains, including NLP⁷, computer vision⁸, and time series analysis⁹. However, their quadratic computational complexity limits input sequence length. Recent studies have tackled this issue by introducing modifications to the full attention mechanism. Longformer¹⁹ introduces a sparse attention mechanism, where each query is restricted to attending only to keys within a defined window or dilated window, except for global tokens, which interact with the entire sequence. Similarly, BigBird²⁰ proposes a sparse attention mechanism consisting of Random, Local, and Global components. The Random component limits each query to attend a fixed number of randomly selected keys. The Local component allows each query to attend keys of nearby neighbors. The Global component selects a fixed number of input tokens to participate in the query-key production process for the entire sequence. In contrast to NLP, where input consists of word sequences, and computer vision²¹, where image patches are used, time series tasks involve historical records at multiple time steps. To effectively capture time series data’s seasonality, having a sufficient historical record length is crucial. For instance, capturing weekly seasonality in MTS data sampled every 10 minutes necessitates approximately $6 \times 24 \times 7$ time steps. Consequently, applying the Transformer architecture to time series data is impeded by its quadratic computational complexity. To address this challenge, various methods have been proposed. Informer¹⁰ introduces a ProbSparse attention mechanism, allowing each key to attend to a limited number of queries. This modification achieves $\mathcal {O}(I\log I)$ complexity in both computational and memory usage. Autoformer¹¹ proposes an Auto-Correlation mechanism designed to model period-based temporal dependencies. It achieves $\mathcal {O}(I\log I)$ complexity by selecting only the top logI query-key pairs. FEDformer¹² introduces frequency-enhanced blocks, including Fourier and Wavelet components, to transform queries, keys, and values into the frequency domain. The attention mechanism computes a fixed number of randomly selected frequency components from queries, keys, and values, resulting in linear complexity. Both Crossformer¹⁴ and PatchTST¹⁶ propose patching mechanisms that partition time series data into patches spanning multiple time steps, effectively reducing the total length of the historical sequence.

Methods

This section presents the Dozerformer method, incorporating the Dozer attention mechanism. We first introduce the Dozerformer framework, comprising a transformer encoder-decoder pair and a linear layer designed to forecast the seasonal and trend components of MTS data. Subsequently, we provide a comprehensive description of the proposed Dozer attention mechanism, focusing on eliminating query-key pairs between input and output time steps that do not contribute to accuracy.

Framework

The MTS forecasting task aims to infer values of D variables for future O time steps $X \in \mathbb {R}^{O\times D}$ from their historical records $X \in \mathbb {R}^{I\times D}$. Figure 2 illustrates the overall framework of Dozerformer. Initially, it decomposes the MTS data into seasonal and trend components, following previous methods^6,11,12,15. Subsequently, the transformer and linear models are employed to generate forecasts for the seasonal component $\textbf{X}_{s}\in \mathbb {R}^{I\times D}$ and the trend component $\textbf{X}_{t}\in \mathbb {R}^{I\times D}$, respectively.

The dimension invariant (DI) embed²² transforms raw single-channel MTS data into multi-channel feature maps while preserving both the time step and variable dimensions. It further partitions the MTS data into multiple patches along the time step dimension, resulting in patched MTS embeddings denoted as $X_{enc} \in \mathbb {R}^{c \times N_{enc} \times p \times D}$. Here, c represents the number of feature maps, $N_{enc} = \lceil I/p \rceil$ indicates the number of patches for the encoder, and p represents the patch size. Similarly, the decoder’s input, denoted as $X_{dec} \in \mathbb {R}^{c \times N_{dec} \times p \times D}$ (not shown in Figure 2), is derived from L historical time steps and O zero padding. In this context, $N_{dec} = \lceil (L+O)/p \rceil$ indicates the number of patches for the decoder. The transformer model is designed to capture temporal dependencies at multiple time step resolutions (sub-series level). While the transformer encoder and decoder adhere to the canonical transformer architecture⁷, they employ the proposed Dozer attention mechanism in place of the canonical full attention. A $1 \times 1$ CNN is employed to generate seasonal component predictions based on the learned latent representations from the transformer.

To predict the trend component, we employ a linear layer for the direct inference of trend forecasts from historical trend components. Subsequently, the forecasts for the seasonal and trend components are aggregated by summation to obtain the final predictions denoted as $\textbf{X}_{pred}\in \mathbb {R}^{O\times D}$. A detailed step-by-step computation procedure is introduced in Supplementary Sect. 2.

Dozer attention

The canonical scaled dot-product attention is as follows:

$$\begin{aligned} \begin{aligned} Q, K, V&={\text {Linear}}\left( \textbf{X}_{enc}^d\right) \\ {\text {Attention}}\left( Q, K, V\right)&= {\text {Softmax}}\left( QK^T/\sqrt{d_k}\right) V \end{aligned} \end{aligned}$$

(1)

where the Q, K, and V represent the queries, keys, and values obtained through embedding from the input sequence of the d-th series, denoted as $\textbf{X}_{enc}^d \in \mathbb {R}^{c \times N_{enc} \times p}$. It’s noteworthy that we flatten both the feature map and patch size dimensions of $\textbf{X}_{enc}^d$, resulting in $\textbf{X}_{enc}^d\in \mathbb {R}^{N_{enc} \times (c\times p)}$, representing the latent representations of patches with a size of p. The scaling factor $d_k$ denotes the dimension of the vectors in queries and keys. The canonical transformer exhibits two key limitations: firstly, its encoder’s self-attention is afflicted by quadratic computational time complexity, thereby restricting the length of $N_{enc}$; secondly, its decoder’s cross-attention employs the entire sequence to forecast all future time steps, entailing unnecessary computation for ineffective historical time steps. To overcome these issues, we propose a Dozer Attention mechanism, comprising three components: Local, Stride, and Vary.

In Fig. 3, we illustrate the sparse attention matrices of Local, Stride, and Vary components employed in the proposed Dozer attention mechanism. Squares shaded in gray signify zeros, indicating areas where the computation of the query-key product is redundant and thus omitted. These areas represent saved computations compared to full attention. The pink-colored squares correspond to the Local component, signifying that each query exclusively calculates the product with keys that fall within a specified window. Blue squares denote the Stride component, where each query selectively attends to keys that are positioned at fixed intervals. The green squares represent the Vary component, where queries dynamically adapt their attention to historical sequences of varying lengths based on the forecasting horizons. As a result, the Dozer attention mechanism exclusively computes the query-key pairs indicated by colored (pink, blue, and green) squares while efficiently eliminating the computation of gray query-key pairs.

Local

The MTS data comprises variables that change continuously and exhibit locality, with nearby values in the time dimension displaying higher correlations than those at greater temporal distances. This characteristic is evident in Fig. 3, which illustrates the strong locality within MTS data. Building upon this observation, we introduce the Local component, a key feature of our approach. In the Local component, each query selectively attends to keys that reside within a specified time window along the temporal dimension. Equation (2)defines the Local component as follows:

$$\begin{aligned} \begin{aligned} A^{self}_{i,j}&= {\left\{ \begin{array}{ll} q_i * k_j, \qquad &{}\text {if }j \in \{|i-j|\le \lfloor w/2 \rfloor \} \\ 0, \qquad &{}\text {otherwise} \\ \end{array}\right. } \\ A^{cross}_{i,j}&= {\left\{ \begin{array}{ll} q_i * k_j, \qquad &{}\text {if }j \in \{0\le t-j\le \lfloor w/2 \rfloor \} \\ 0, \qquad &{}\text {otherwise} \end{array}\right. } \\ \end{aligned} \end{aligned}$$

(2)

Where A represents the attention matrix that records the values of production between queries and keys. The superscripts self and cross denote self-attention and cross-attention, respectively. The subscripts i and j represent the locations or time steps of the query vector and key vector along the temporal dimension, respectively. t specifies the time step at which forecasting is conducted, and w represents the window size for the Local component. In the case of self-attention, the Local component enables each query to selectively attend to a specific set of keys that fall within a defined time window. Specifically, each query in self-attention attends to $2 \times \lfloor w/2 \rfloor +1$ keys, with $\lfloor w/2 \rfloor$ keys representing the neighboring time steps on each side of the query. For cross-attention, the concept of locality differs since future time steps cannot be utilized during forecasting. Instead, the locality is derived from the last $\lfloor w/2 \rfloor +1$ time steps, situated at the end of the sequence. These steps encompass the most recent time points up to the moment of forecasting, denoted as time t.

Stride

Seasonality is a key characteristic of time series data, and to capture this attribute effectively, we introduce the Stride component. In this component, each query selectively attends to keys positioned at fixed intervals in the temporal dimension. To illustrate this concept, let’s consider a query at time step t, denoted as $q_t$, and assume that the time series data exhibits a seasonality pattern with a period (interval). Equation (3) defines the Stride component as follows:

$$\begin{aligned} A_{i,j} = {\left\{ \begin{array}{ll} q_i * k_j, \qquad &{}\text {if }j \in \{|i-j|\mod s=0\} \\ 0, \qquad &{}\text {otherwise} \end{array}\right. } \end{aligned}$$

(3)

In self-attention, the Stride component initiates by computing the product between the query $q_t$ and the key at the same time step, denoted as $k_t$. It then progressively expands its attention to keys positioned at a temporal distance of interval time steps from t, encompassing keys like $k_{t-interval}$, $k_{t+interval}$, and so forth, until it spans the entire sequence range. For cross-attention, the Stride component identifies time steps within the input sequence $X_I$ that are separated by multiples of interval time steps, aligning with the seasonality pattern. Hence, the Stride component consistently computes attention scores (query-key products) using these selected keys from the Encoder’s I input time steps, yielding a total of $s = \lfloor I/interval \rfloor$ query-key pairs.

Vary

In the context of MTS forecasting scenarios employing the canonical attention mechanism, predictions for all future time steps are computed through a weighted sum encompassing the entirety of the historical sequence. Regardless of the forecasting horizon, this canonical cross-attention mechanism diligently computes all possible query-key pairs. However, it is important to recognize that harnessing information from the entire historical sequence does not necessarily improve forecasting accuracy, especially in the case of time series data characterized by evolving distributions, as exemplified by stock prices.

To tackle this challenge, we propose the Vary component, which dynamically expands the historical time steps considered as the forecasting horizon extends. We define the initial length of the historical sequence utilized as v. When forecasting a single time step into the future (horizon of 1), the Vary component exclusively utilizes v time steps from the sequence. As the forecasting horizon increases gradually by 1, the history length used also increments by 1, starting from the vary window v until it reaches the maximum input length. Equation (4) defines the Vary component as follows:

$$\begin{aligned} A^{cross}_{i,j} = {\left\{ \begin{array}{ll} q_i * k_j, \qquad &{}\text {if }j \in \{|i-j|< v+i-t\}\text { and }i>t \\ 0, \qquad &{}\text {otherwise} \end{array}\right. } \end{aligned}$$

(4)

Results

Experimental settings

Datasets

We conduct experiments on 9 public benchmarks¹¹, including 4 ETT datasets(ETTh1, ETTh2, ETTm1, ETTm2), Traffic, Electricity, Weather, Exchange-Rate, and ILI. These are widely used real-world MTS datasets with different characteristics. Details are introduced in Supplementary Sect. 1.

Comparison methods

We compare our Dozerformer with seven transformer-based methods (Informer¹⁰, Autoformer¹¹, FEDformer¹², Pyraformer¹³, Crossformer¹⁴, PatchTST¹⁶), CNN-based method MICN⁶, and Linear method Dlinear¹⁵.

Evaluation metrics

We utilize Mean Squared Error (MSE) and Mean Absolute Error (MAE) as the evaluation metrics to compare the forecasting accuracy.

Training details

We implemented the proposed Dozerformer in Pytorch and trained with Adam optimizer ($\beta _1=0.9$ and $\beta _2 = 0.99$). The initial learning rate is selected from $\{5e-5, 1e-4, 5e-4, 1e-3\}$ via grid search and updated using the Cosine Annealing scheme. The transformer encoder layer is set to 2 and the decoder layer is set to 1. We select the look-back window size of historical time steps utilized from $\{96, 192, 336, 720\}$ via grid search except for the ILI dataset which is set to 120. The patch size is selected from $\{24, 48, 96\}$ via grid search. The main results, summarized in Table 1, are average values derived from six repetitions using seeds 1, 2022, 2023, 2024, 2025, and 2026. The ablations study and parameter sensitivity results are obtained from random seed 1.

Main results

Table 1 Comparison of quantitative results (MSE &MAE) on nine datasets for forecasting horizons $O \in \{96, 192, 336, 720\}$ (For ILI, $O \in \{24, 36, 48, 60\}$). Bold and underlined highlight the best and second-best results, respectively.

Full size table

Table 1 presents a quantitative comparison between the proposed Dozerformer and baseline methods across nine benchmarks using MSE and MAE, where the bolded and underlined case (dataset, horizon, and metric) represent the best and second-best results, respectively. The Dozerformer achieved superior performance than state-of-the-art baseline methods by performing the best in 48 cases and second-best in 22 cases. Compared to state-of-the-art baseline methods, Dozerformer achieved reductions in MSE: 0.4% lower than PatchTST¹⁶, 8.3% lower than Dlinear¹⁵, 20.6% lower than MICN⁶, and a substantial 40.6% lower than Crossformer¹⁴. While PatchTST came closest to our method in performance, it utilizes full attention, making it less efficient in terms of computational complexity and memory usage. We conclude that Dozerformer outperforms recent state-of-the-art baseline methods in terms of accuracy. It is worth mentioning that we conducted the Diebold-Mariano test^23,24 to compare Dozerformer, PatchTST, and Dlinear with Informer. The resulting p-values were above 0.05, indicating that the baseline methods remain highly accurate, and the accuracy improvements were not statistically significant. For instance, Figures S1 and S2 in the supplementary files demonstrate that all methods accurately predicted both trend and seasonality, with minimal significant differences between them.

Computational efficiency

We analyze the computational complexity of transformer-based methods and present it in Table 2. For self-attention, the proposed Dozer self-attention achieved linear computational complexity w.r.t I with a coefficient $(w+s)/p$. w and s denote the keys that the query attends with respect to Local and Stride components, which are small numbers (e.g. $w\in \{1, 3\}$ and $s\in \{2, 3\}$). In contrast, p corresponds to the size of time series patches and is set to a larger value (e.g., $p\in {24, 48, 96}$). We conclude that the coefficient $(w+s)/p$ is consistently smaller than 1 under the experimental conditions, highlighting the superior computational complexity performance of Dozer self-attention.

Table 2 Computational complexity of self-attention and cross-attention. The Encoder’s input historical sequence length is denoted I, and The Decoder’s input historical sequence length and forecasting horizons are denoted as L and O, respectively. D indicates the variable dimension of MTS data. p is the patch size.

Full size table

To analyze the computational complexity of cross-attention, it’s essential to consider the specific design of the decoder for each method and analyze them individually. The Transformer calculates production between all $L+O$ queries and I keys, with its complexity influenced by inputs for both the encoder and decoder. The Informer selects $log(L+O)$ queries and I/2 keys (time steps near t), resulting in $\mathcal {O}(log(L+O)I/2)$. Autoformer and FEDformer pad zeros to keys when I is smaller than $L+O$ and only select the last $L+O$ keys when I is greater than $L+O$. As a result, their cross-attention complexity is linear w.r.t $L+O$. The Crossformer decoder takes only O zero paddings as input. Consequently, its cross-attention attends to O/p queries and I/p keys. It’s worth noting that Crossformer also applies full attention across the variable dimension, so its complexity is additionally influenced by the variable count D. PatchTST solely employs the transformer encoder, making cross-attention irrelevant in this context.

The Local and Stride components of Dozer cross-attention specifically address $(L+O)/p$ queries, calculating the product of each query with w keys within the local window and s keys positioned at fixed intervals from the query. Consequently, their computational complexity is linear with respect to $L+O$, characterized by the coefficient $(w+s)/p$. The Vary component’s complexity exhibits a quadratic relationship with respect to O, with a coefficient of $1/(2p^2)$. This quadratic complexity is notably more efficient compared to linear complexity when $O<2p^2$ (e.g. 1152 when $p=24$). Additionally, the Vary component maintains linear complexity concerning O when v is greater than 1, although its coefficient $(v-1)/p$ should be significantly less than 1 for practical efficiency. Throughout all experiments, we consistently set v to 1, rendering $(v-1)O/p$ negligible.

The memory complexity of the Dozer Attention is $\mathcal {O}((w+s)I/p)$ for self-attention and $\mathcal {O}((w+s)(L+O)/p + (O/p)^2/2 +(v-1)O/p)$ for cross-attention. It’s worth noting that the analysis of Dozer Attention’s computational complexity is equally applicable to its memory complexity. In conclusion, the Local and Stride components excel in efficiency by attending queries to a limited number of keys. As for the Vary component, it proves to be more efficient than linear complexity in practical scenarios, with its complexity only influenced by forecasting horizons.

We selected ETTh1 as an illustrative example and evaluated the Params (total number of learnable parameters), FLOPs (floating-point operations), and Memory (maximum GPU memory consumption). With a historical length set at 720 and a forecasting horizon of 96, Table 3 presents the quantitative results of this efficiency comparison. Dlinear¹⁵, characterized by a straightforward architecture employing a linear layer to directly generate forecasting values from historical records, achieved the best efficiency among the methods considered. Notably, the proposed Dozerformer exhibited superior efficiency compared to other baseline methods based on CNN or transformers. It is essential to highlight that patchTST¹⁶ is the only method that did not significantly lag behind Dozerformer in accuracy. However, its parameter count is six times greater than that of Dozerformer, and FLOPs are 17 times larger.

Table 3 The Model complexity results in ETTh1 dataset for input length of 720 and forecasting horizon 96.

Full size table

Ablation studies

Local, stride, and vary components ablation

To investigate the impact of the Local, Stride, and Vary components individually, we conducted experiments by using only one of these components in the Dozer Attention. The results are summarized in Table 4, the mean value across four horizons (96, 192, 336, and 720) for seed 1 was presented. The hyper-parameter of each component follows the setting of the main results. Overall, each component working alone performed slightly worse than the Dozer Attention, showing the effectiveness of each component in capturing the essential attributes of MTS data.

Table 4 Ablation study of three key components of Dozer attention: Local, Stride, and Vary.

Full size table

Attention mechanism comparison

To demonstrate the effectiveness of the proposed Dozer attention mechanism, we conducted a comparative analysis by replacing it with other attention mechanisms in the Dozerformer framework. Specifically, we compared it with canonical full attention⁷, Auto-Correlation¹¹, Frequency Enhanced Auto-Correlation¹², and ProbSparse attention¹⁰. The results of this comparison are presented in Table 5. The Dozer Attention outperformed other attention mechanisms, securing the top position with 22 best-case scenarios out of 32. Canonical full attention performed as the second-best, achieving the best results in 9 cases. Significantly, the Dozer Attention reduced the average number of query-key pairs to 31.7% and 28.4% for self-attention in the encoder and cross-attention in the decoder, respectively, compared to full attention. This reduction represents a substantial decrease in computational complexity by eliminating the majority of query-key pairs. The Auto-Correlation performed notably worse than the proposed Dozer Attention, with its MSE and MAE being 20.1% and 13.6% higher, respectively. The Frequency Enhanced Auto-Correlation exhibited performance that was 13.8% worse in MSE and 8% worse in MAE compared to the Dozer Attention. The ProbSparse Attention, while still competitive, performed slightly worse, with 5.7% higher MSE and 2% higher MAE than the Dozer Attention. We conclude that the proposed Dozer attention is just as capable as the canonical full attention but significantly more efficient than recent state-of-the-art attention mechanisms.

Table 5 Comparison of Dozerformer incorporating state-of-the-art attention mechanisms.

Full size table

Parameter sensitivity

Effect of local, stride, and vary size

To investigate the effect of local size w, stride size s, and vary size v to MTS forecasting accuracy, we conduct experiments and present the results in Fig. 4. These hyperparameters influence the sparsity of the attention matrix, consequently affecting the efficiency of the Dozerformer in terms of computation and memory usage. A smaller local size w, vary starting size v, and a larger stride size s result in a more sparse attention matrix, contributing to increased efficiency. It is noteworthy that an increase in historical time steps utilized in the attention mechanism does not necessarily lead to enhanced accuracy. The optimal local size varies based on the dataset’s characteristics, but we observe higher MSE values as the local size increases and more time steps are utilized. Regarding the stride size, it reaches an optimum of 9 for the ETTh1 and ETTm1 datasets, while it is 1 for the Weather datasets. For the vary size, a small value of 1 produces good performance in terms of accuracy and efficiency. In summary, we conclude that the selection of these hyperparameters varies across datasets. However, a sparse matrix, in general, performs better than a more dense attention matrix.

Effect of look-back window size

The size of the look-back window, denoted as I, impacts primarily the stride component of the Dozerformer, as it determines the model’s receptive field. Figure 5 illustrates the effect of look-back window size on seven methods at horizons 96 and 720. Remarkably, the Dozerformer consistently outperforms other methods for various sequence lengths in the ETTh1, ETTm1, and Weather datasets. The Exchange-Rate dataset is an exception; due to its lack of seasonality, all methods experienced a decline in accuracy with longer input sequences. Notably, an increase in input sequence length does not necessarily lead to improved accuracy for the Dozerformer. In contrast, for the remaining baseline methods, their performance tends to deteriorate as the input sequence length increases. We attribute this trend to longer input sequences incorporating more historical time steps that lack correlation with the forecasting target, thereby degrading the performance of these methods. The Dozerformer, by nature of its sparsity, is less affected, as it naturally excludes uncorrelated time steps from its attention mechanism.

Conclusion

This paper introduced a sequence adaptive sparse transformer framework named Dozerformer for MTS forecasting. The Dozerformer incorporates a sparse Dozer attention mechanism which consists of local, stride, and vary components to selectively attend each query to keys based on the characteristic of MTS data and forecasting horizons. We demonstrated the superior performance of Dozerformer against recent state-of-the-art methods in both accuracy and efficiency. The design of the Local, Stride, and Vary components requires proper selection of their sizes, necessitating an understanding of the dataset characteristics to optimally set these hyperparameters. For future work, we plan to further modify the Dozer attention to model multi-scale temporal dependencies and adaptively select subsets of historical time steps for the sparse attention mechanism.

Data availability

In this manuscript, we utilized nine widely recognized benchmark multivariate time series datasets. Those datasets are publicly available at https://github.com/thuml/Autoformer. We implemented our proposed Dozerformer method in PyTorch and it is publicly available at https://github.com/GRYGY1215/Dozerformer.

References

Petropoulos, F. et al. Forecasting: theory and practice. Int. J. Forecast. 38, 705–871. https://doi.org/10.1016/j.ijforecast.2021.11.001 (2022).
Article Google Scholar
Lai, G., Chang, W.-C., Yang, Y. & Liu, H. Modeling long-and short-term temporal patterns with deep neural networks. in: The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval, pp. 95–104 (2018).
Zhang, Y., Wu, R., Dascalu, S. M. & Harris, F. C. Jr. A novel extreme adaptive GRU for multivariate time series forecasting. Sci. Rep. 14, 2991 (2024).
Article CAS PubMed PubMed Central Google Scholar
Zhang, Y. et al. Data regression framework for time series data with extreme events. in: 2021 IEEE International Conference on Big Data (Big Data), pp. 5327–5336, https://doi.org/10.1109/BigData52589.2021.9671387 (2021).
Shih, S.-Y., Sun, F.-K. & Lee, H.-Y. Temporal pattern attention for multivariate time series forecasting. Mach. Learn. 108, 1421–1441 (2019).
Article MathSciNet Google Scholar
Wang, H. et al. Micn: multi-scale local and global context modeling for long-term series forecasting. in: International Conference on Learning Representations (2023). https://openreview.net/forum?id=zt53IDUR1U.
Vaswani, A. et al. Attention is all you need. In Advances in neural information processing systems (eds Guyon, I. et al.) (Curran Associates Inc, UK, 2017).
Google Scholar
Dosovitskiy, A. et al. An image is worth 16x16 Words: transformers for image recognition at scale. ICLR (2021).
Wen, Q. et al. Transformers in time series: a survey. in: International Joint Conference on Artificial Intelligence(IJCAI) (2023).
Zhou, H. et al. Informer: beyond efficient transformer for long sequence time-series forecasting. in: The Thirty-Fifth AAAI Conference on Artificial Intelligence, AAAI 2021, Virtual Conference, pp. 11106–11115 (AAAI Press, 2021).
Wu, H., Xu, J., Wang, J. & Long, M. Autoformer: decomposition transformers with auto-correlation for long-term series forecasting. in: Advances in Neural Information Processing Systems (2021).
Zhou, T. et al. FEDformer: frequency enhanced decomposed transformer for long-term series forecasting. in: Proceedings 39th International Conference on Machine Learning (ICML 2022) (2022).
Liu, S. et al. Pyraformer: low-complexity pyramidal attention for long-range time series modeling and forecasting. in: International Conference on Learning Representations (2022).
Zhang, Y. & Yan, J. Crossformer: Transformer utilizing cross-dimension dependency for multivariate time series forecasting. in: International Conference on Learning Representations (2023).
Zeng, A., Chen, M., Zhang, L. & Xu, Q. Are transformers effective for time series forecasting? 11121–11128 (AAAI Press, USA, 2023).
Google Scholar
Nie, Y., H. Nguyen, N., Sinthong, P. & Kalagnanam, J. A time series is worth 64 words: long-term forecasting with transformers. in: International Conference on Learning Representations (2023).
Wu, Z., Pan, S., Long, G., Jiang, J. & Zhang, C. Graph wavenet for deep spatial-temporal graph modeling. ar**v preprint ar**v:1906.00121 (2019). Ar**v:1906.00121 https://doi.org/10.48550/ar**v.1906.00121.
Wu, Z. et al. Connecting the dots: multivariate time series forecasting with graph neural networks. in: Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (2020).
Beltagy, I., Peters, M. E. & Cohan, A. Longformer: the long-document transformer. ar**v:2004.05150 (2020).
Zaheer, M. et al. Big bird: transformers for longer sequences. Advances in Neural Information Processing Systems33 (2020).
Khan, S. et al. Transformers in vision: a survey. ACM Comput. Surv. (CSUR) 54, 1–41 (2022).
Article Google Scholar
Zhang, Y., Wu, R., Dascalu, S. M. & Harris, F. C. Multi-scale transformer pyramid networks for multivariate time series forecasting. IEEE Access 12, 14731–14741. https://doi.org/10.1109/ACCESS.2024.3357693 (2024).
Article Google Scholar
Diebold, F. X. & Mariano, R. S. Comparing predictive accuracy. J. Bus. Econ. Stat. 20, 134–144. https://doi.org/10.1198/073500102753410444 (2002).
Article MathSciNet Google Scholar
Harvey, D., Leybourne, S. & Newbold, P. Testing the equality of prediction mean squared errors. Int. J. Forecast. 13, 281–291. https://doi.org/10.1016/S0169-2070(96)00719-4 (1997).
Article Google Scholar

Download references

Acknowledgements

This material is based in part upon work supported by: The National Science Foundation under grant number(s) NSF awards 2142428, 2142360, OIA-2019609, and OIA-2148788. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.

Author information

Authors and Affiliations

Department of Computer Science and Engineering, University of Nevada, Reno, NV, 89557, USA
Yifan Zhang, Sergiu M. Dascalu & Frederick C. Harris Jr.
Department of Computer Science, East Carolina University, Greenville, NC, 27858, USA
Rui Wu
Department of Computer Science, Missouri State University, Springfield, MO, 65897, USA
Yifan Zhang

Authors

Yifan Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Rui Wu
View author publications
You can also search for this author in PubMed Google Scholar
Sergiu M. Dascalu
View author publications
You can also search for this author in PubMed Google Scholar
Frederick C. Harris Jr.
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Y.Z. and R.W. developed the approach. Y.Z. implemented the method in Pytorch and conducted experiments. Y.Z. wrote the manuscript. R.W., S.D., and F.H. proofread the manuscript. All authors reviewed the manuscript.

Corresponding author

Correspondence to Yifan Zhang.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher's note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Supplementary Information.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Zhang, Y., Wu, R., Dascalu, S.M. et al. Sparse transformer with local and seasonal adaptation for multivariate time series forecasting. Sci Rep 14, 15909 (2024). https://doi.org/10.1038/s41598-024-66886-1

Download citation

Received: 31 January 2024
Accepted: 05 July 2024
Published: 10 July 2024
DOI: https://doi.org/10.1038/s41598-024-66886-1
Springer Nature Limited

Sparse transformer with local and seasonal adaptation for multivariate time series forecasting

Abstract

Similar content being viewed by others

RSMformer: an efficient multiscale transformer-based framework for long sequence time-series forecasting

Double-Layer Attention for Long Sequence Time-Series Forecasting

Foreformer: an enhanced transformer-based framework for multivariate time series forecasting

Introduction

Related work

MTS forecasting

Sparse self attentions

Methods

Framework

Dozer attention

Local

Stride

Vary

Results

Experimental settings

Datasets

Comparison methods

Evaluation metrics

Training details

Main results

Computational efficiency

Ablation studies

Local, stride, and vary components ablation

Attention mechanism comparison

Parameter sensitivity

Effect of local, stride, and vary size

Effect of look-back window size

Conclusion

Data availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Additional information

Publisher's note

Supplementary Information

Supplementary Information.

Rights and permissions

About this article

Cite this article

Share this article

Search

Navigation