Introduction

The detection of anomalies in multivariate time-series data based on spatiotemporal modeling is an emerging research field, aiming to capture spatiotemporal dependencies from massive multivariate time-series data and achieve more sensitive anomaly detection through richer feature representations. In the real world, spatiotemporal data are present in various domains, including industry, transportation, meteorology, and finance [1]. These not only exhibit the characteristics of time-series, but encompass diverse aspects such as physical properties and spatial–topological structures, and are multivariate, with high dimensionality and complexity. Hence, to perform automatic anomaly detection on these massive spatiotemporal data can enhance the efficiency and accuracy of data analysis, effectively reducing the risk of accidents in practical industrial production processes, and holding significant economic and safety value [15]. GTA combines Transformer and GNN in a hierarchical attention mechanism that considers the correlations between spatiotemporal features [16]. Han further advances this approach by integrating sparse self-encoders with graph neural networks, orchestrating a collaborative optimization of both the reconstruction and prediction tasks [17]. TranAD introduces a model for anomaly detection and diagnosis, leveraging a profound transformer network. This network integrates an attention-based sequence encoder, facilitating swift comprehension of overarching temporal patterns in the data [18]. The above literature models spatiotemporal features separately, obtaining future feature behavior expressions through prediction-based approaches, and conducts anomaly detection based on expression differences. However, the modeling process overlooks the impact of preprocessing methods like normalization on the loss of abnormal behavioral information, and there is still room for improvement in the modeling of spatio correlations between features and exploration of spatio constraints to strengthen model inference.

We propose an enhanced abnormal information expression spatiotemporal model for anomaly detection in multivariate time-series (EAIE-AD), which is capable of end-to-end anomaly detection under unsupervised training conditions. Our model performs simultaneous deep temporal modeling in a parallel manner. Through experiments, we have found that while Transformer and its variants have demonstrated strong modeling capabilities in prediction tasks through optimizing the mean squared error (MSE) and mean absolute error (MAE), this goal does not directly yield the final anomaly detection effect (as shown in Table 4). Hence, our goal in the prediction phase is not to optimize these metrics all the time, but to be able to learn more valid representations of anomalous behavior information. Hence, we focus on the non-stationary information in the original data but potentially lost due to normalization, and improve upon the work of non-stationary Transformer [19] by simplifying its model structure and modifying the object of action of the attention mechanism; in the spatio module, we learn the feature association graph of multivariate time-series data, and expand the homogeneous graph to a heterogeneous graph based on a GDN [15], which can more finely simulate the physical characteristics of the features, and on the graph structure, using two contrast learning strategies to find beneficial spatio constraints to strengthen our learning ability for the representation of anomalous behavior information. The contributions made by our model are summarized as follows:

(1) We propose an end-to-end spatiotemporal anomaly detection model that can simultaneously model feature temporal dependencies and spatio correlations in depth through a parallel architecture and guide the full training of the model under unsupervised conditions by enhancing abnormal information expression.

(2) In the temporal dimension, we compensate for the effective modeling of the inherent unsteadiness information in the original data, and mitigate the loss of anomalous information by improving the execution objects of the attention mechanism; in the spatio dimension, we use graph neural networks and contrast learning to model the physical properties of feature behaviors, from which we extract the hidden expression of anomalous behavior information in the spatio topology.

(3) We achieve state-of-the-art anomaly detection results on multiple datasets, with the F1-scores reaching 0.82 and 0.59 on the SWaT and WADI datasets, respectively. Then, we conduct adequate ablation experiments and data visualization. Finally, we enhance the interpretability of the model by exploring the impact of upstream multivariate timing prediction tasks on downstream anomaly detection tasks.

Related work

Time-series anomaly detection

The classical methods employed in multivariate time-series data anomaly detection tasks are primarily reconstruction- [20, 21] or prediction-based [22, 23], which can, respectively, compress data representation or model temporal correlation [24]. In addition, dimensionality-reduction methods, such as principal component analysis (PCA), singular value decomposition (SVD), and autoencoder (AE), as commonly used in machine learning, have been shown to be effective in assisting anomaly detection. In PCA, anomalies are defined as data points that deviate from the normal data space [25]; in SVD, anomalies are defined as high-dimensional data that have not been reconstructed [26]; AE detects anomalies by learning a self-encoder from the data, where the reconstruction error of anomalous data is usually greater than that of normal data [27].

With the high dimensionality of data, by being able to automatically learn its features and better model nonlinear relationships, researchers have turned their attention to deep learning for anomaly detection [28, 29]. Chen proposed an anomaly detection method for multivariate temporal data, using a variational autoencoder (VAE) model to learn the latent representation of the data, and reconstruction error and the Kullback–Leibler (KL) scatter of the latent representation as anomaly metrics [30]. Kong proposed a long short-term memory (LSTM)-based method, using an attention mechanism to assign weights to temporal features [31]. Transformer with an attention mechanism as its structural core has enabled a breakthrough in deep time-series prediction. Its point-to-point attention mechanism is suitable for modeling temporal dependencies in time-series, and stackable codecs are conducive to capturing and aggregating temporal features at different time scales. Hence, Transformer-based improved anomaly detection methods have been proposed. Jeong proposed a self-supervised learning method that uses the transformer model to learn a representation of multivariate time-series data, and uses the distance of the representation to measure the degree of anomalies of data points [32]. Wiederer used Transformer to explore the variability of the association between anomalous moments and local and global moments, deriving a sensitive differentiation principle [12].

While Transformer can well model the relationships between moments, information transfer between features at any moment is also important for learning anomalous behavior. The spatio topology formed by this information transfer is in a non-Euclidean space, and this spatio dependence is difficult to model with conventional neural networks due to the sparsity of the structure [33, 34].

Anomaly detection based on spatiotemporal modeling

Graph neural networks can address the limitations of non-Euclidean space-dependent modeling, handling the relationships between features and enabling end-to-end learning with good robustness [35, 36], and hence are receiving increasing attention in anomaly detection. GGC-LSTM combines the advantages of graph convolutional neural networks and long short-term memory (LSTM) networks, which can consider both graph structure and time-series information [37]. Zhao used a deep graph convolutional neural network that can adaptively learn the relationships between time-series data with high computational efficiency [38]. GTA model spatiotemporal dependency for multivariate sensor features in IoT systems in a tandem fashion and enhances model inference efficiency through a hierarchical attention mechanism [16].

Deng proposed a general framework for spatiotemporal modeling anomaly detection network (GDN) [15], which has received attention for its excellent results on several realistic datasets. GDN learns a spatial-association graph between sensor features in the absence of a priori knowledge, uses an attention mechanism for information transfer and message updating on the graph structure, and assists the learned information in the temporal prediction task, which can more sensitively capture deviations in the future behavior of sensors and improve detection performance. However, there is room for improvement. In the temporal sequence prediction process, the model adopts a traditional normalization strategy, ignoring the expression of inherent unsteadiness in the data, which can be useful in learning behavior patterns at anomalous moments. The feature relationship graph is homogeneous, and cannot well describe physical characteristics in realistic scenarios. We propose an end-to-end spatiotemporal anomaly detection model, which can enhance abnormal information expression through modeling spatiotemporal dependency to guide the full training of the model.

Methods

We describe our proposed model. Figure 1 shows a high-level overview of EAIE-AD, with an end-to-end spatiotemporal model that incorporates temporal and spatio dependency.

Fig. 1
figure 1

Model framework

Problem formulation

Assume a collection of multivariate time-series data obtained from \(d\) features at \({T}_{{\text{train}}}\) time stamps, denoted as \(S=\left\{{S}_{1},\dots ,{S}_{{T}_{{\text{train}}}}\right\}, i\in \left\{1,...,{T}_{{\text{train}}}\right\},{ S}_{i} \in {\mathbb{R}}^{d}\). Our model is trained in an unsupervised manner. The training and validation datasets consist of normal data, while the test dataset has both normal and abnormal data. We seek to acquire knowledge about the behavior of the features through the training set and identify any anomalous time stamps within the test set, assigning a label to each time step in the test set, where 0 and 1 denote normal and abnormal, respectively.

Temporal dependency modeling

Our goal is to predict the behavior of features through time-series modeling. While Transformer is popular for temporal prediction due to its powerful long-time-series modeling capability, for downstream anomaly detection, to learn enough validly expressed anomalous behavior information is more important than realizing small values of MSE and MAE in prediction. Usually, non-stationary information can imply more abnormal behavior expressions, but in the traditional Transformer input, due to operations such as normalization, the input loses some non-stationary information, and may bring about data over-stationarity problems and reduce the performance of the attention mechanism.

The result, if not normalized, is a nonuniform feature scale and more noisy points, reducing prediction performance. Therefore, we compensate for the normalization information and optimize the object of attention. We first slice the original input data \(S\) in the form of a sliding window. For any moment sample \(S_{t}\), we intercept the time window series whose historical series length is \(w\) to get \(X = \left[ {X_{1} , \ldots ,X_{w} } \right]^{T} = \left[ {S_{t} , S_{t - 1} , \ldots ,S_{t - w + 1} } \right]^{T} , X \in {\mathbb{R}}^{w \times d} ,t \ge w\). For each time window \(X\), we normalize to obtain \(X{\prime} = \left[ {X_{1}^{\prime} , \ldots ,X_{w}^{\prime} } \right]^{T}\), where

$$ \mu_{x} = \frac{1}{w}\sum\limits_{i = 1}^{w} {X_{i} } , \sigma_{x}^{2} = \frac{1}{w}\sum\limits_{i = 1}^{w} {\left( {X_{i} - \mu_{x} } \right)^{2} } , X_{i}^{\prime} = \frac{1}{{\sigma_{x} }} \odot \left( {X_{i} - \mu_{x} } \right) for i \in \left\{ {1, \ldots ,w} \right\}, $$
(1)

where \({\mu }_{x},{\sigma }_{X} \in {\mathbb{R}}^{d}\), and \(\odot\) denotes the element-wise product. At this point, for uniform characteristic scales, we use the normalization module to the input time-series. However, normalization will reduce the unsteadiness of the data distribution. On the one hand, some anomalous information will be lost. On the other hand, the input sequence of the attention mechanism may not be able to produce differentiated attention due to over-smoothing, which leads to the degradation of the model training performance. Hence, we change the execution object of the attention mechanism. The standard self-attention mechanism is

$${\text{Att}}\left(Q,K,V\right)={\text{Softmax}}\left(\frac{Q{K}^{{\text{T}}}}{\sqrt{{d}_{k}}}\right)V,$$
(2)

where \(Q,K,V \in {\mathbb{R}}^{w\times {d}_{k}}\) are queries, keys, and values, respectively. \({\text{Softmax}}\) is an exponential normalization function. With the normalization of Eq. 1, each feature variable in the sequence has the same variance, so \({\sigma }_{X}\) can be converted to a scalar. Because the embedding and feedforward layers have linear properties, \(Q^{\prime} = \left( {Q - 1\mu_{Q}^{{\text{T}}} } \right)/\sigma_{X}\) is formed by the projection of \(X^{\prime}\), where \(\mu_{Q} \in {\mathbb{R}}^{{d_{k} }}\), and we can obtain

$$ {\text{Softmax}}\left( {\frac{{QK^{{\text{T}}} }}{{\sqrt {d_{k} } }}} \right) = {\text{ Softmax}}\left( {\frac{{\sigma_{x}^{2} Q^{\prime}(K^{\prime} )^{{\text{T}}} + 1\left( {\mu_{Q}^{{\text{T}}} K^{{\text{T}}} } \right) + \left( {Q\mu_{K} } \right)1^{{\text{T}}} - 1\left( {\mu_{Q}^{{\text{T}}} \mu_{K} } \right)1^{{\text{T}}} }}{{\sqrt {d_{k} } }}} \right), $$
(3)

where \(1h{\mathbb{R}}^{w\times 1}, Q{\mu }_{K}\in {\mathbb{R}}^{w\times 1}\), \({\mu }_{Q}^{{\text{T}}}{\mu }_{K}\) is a scalar, and \(Softmax(\cdot )\) is invariant to the same translation on the row dimension of input, and we have

$$ {\text{Softmax}}\left( {\frac{{QK^{{\text{T}}} }}{{\sqrt {d_{k} } }}} \right) = {\text{ Softmax}}\left( {\frac{{\sigma_{x}^{2} Q^{\prime}(K^{\prime} )^{{\text{T}}} + 1\left( {\mu_{Q}^{{\text{T}}} K^{{\text{T}}} } \right)}}{{\sqrt {d_{k} } }}} \right). $$
(4)

In this way, we obtain an improved attention calculation that can benefit from the predictability of a stationary sequence while maintaining the inherent temporal correlation of the original. However, the assumption that the linear embedding and feedforward layer have linear properties holds with difficulty in practice, and there are often numerous nonlinear activation factors. We need to compensate for this nonlinear information based on \({\sigma }_{x}^{2}\) and \(1\left({\mu }_{Q}^{{\text{T}}}{K}^{{\text{T}}}\right)\) by multilayer perceptron (MLP) to learn two hyperparameters, to obtain non-stationary attention mechanism calculation formula [19]

$$ {\text{Att}}\left( {Q^{\prime},K^{\prime},V^{\prime}} \right){ } = {\text{ Softmax}}\left( {\frac{{{\text{MLP}}\left( {\sigma_{x}^{2} } \right)Q^{\prime}(K^{\prime} )^{{\text{T}}} + MLP\left( {\mu_{Q}^{{\text{T}}} K^{{\text{T}}} } \right)}}{{\sqrt {d_{k} } }}} \right){ }V^{\prime}. $$
(5)

To our knowledge, ours is the first work to simplify and introduce this attention mechanism to multivariate temporal anomaly detection. Hence, we can obtain the values of all features at any moment \(t\) by a feedforward neural network (FNN) [11] based on historical serial time window data as

$$ Y_{t} = {\text{FFN}}\left( {{\text{Att}}\left( {Q^{\prime},K^{\prime},V^{\prime}} \right)} \right), $$
(6)
$${\text{FFN}}\left(x\right)=wx+b,$$
(7)

where \(w\) is the weight matrix and \(b\) is the bias term, \({Y}_{t}\in {R}^{d}\). Then, we calculate the MSE loss as:

$${\zeta }_{{\text{MSE}}}= \frac{1}{{T}_{{\text{train}}}} \sum_{t \in {T}_{{\text{train}}}}{({Y}_{t}-{S}_{t})}^{2}.$$
(8)

Spatial-dependency modeling

To comprehend the interconnections and relationships among features enables us to acquire insights through contrastive learning techniques based on the graph structure. This provides useful supervisory information that enhances abnormal information expression learning.

Graph structure learning

Following the work of GDN, the original training data feature \(d\) types of sensors at different graph nodes. Any sensor will be randomly initialized to the \(d_{1}\) dimension embedding vector based on the sequence ID, and represented as

$$ O_{i} \in {\mathbb{R}}^{{d_{1} }} ,for \quad i \in \left\{ {1, \ldots ,d} \right\}. $$
(9)

We calculate the similarity between sensor representations \(O_{i}\) and \(O_{j}\) for each time stamp

$$ e_{ij} = \frac{{O_{i}^{{\text{T}}} O_{j} }}{{\left\| {O_{i} } \right\| \cdot \left\| {O_{j} } \right\|}}. $$
(10)
$$ A_{ji} = 1\left\{ {j \in topK\left( {\left\{ {e_{ki} :k \in {\mathcal{C}}_{i} } \right\}} \right)} \right\}. $$
(11)

For a given sensor node\(i\), we select the top \(K\) nodes with the highest similarity to as the candidate relations \({\mathcal{C}}_{i}\), where \(A_{ij}\) represents an edge from node \(i\) to node\(j\), so as to obtain the homogeneous graph with only one node type and edge type \(G\left( {O,A} \right)\).

In actual industrial systems, the various models of sensor functions can be placed in one of two categories according to the nature of their work. One is to perform control operations, and the other to monitor the indicators of the working environment, which we, respectively, call actuators and monitors, and we classify nodes according to these attributes. According to the node classification, we can get two types of edges, which connect nodes of either the same or different type. This constitutes our heterogeneous association graph \({G}_{0}(O,A,\alpha ,\beta )\), where \(\alpha \) and \(\beta \), respectively, denote node and edge types.

Graph contrastive learning

Our main goal of graph contrastive learning based on the heterogeneous graph \({G}_{0}\) is to find beneficial spatio constraints. Its two main steps are data augmentation and sampling. We perform an initial spatio embedding of the original data of the graph nodes into the \({d}_{1}\)-dimensional space to obtain a representation of any node in the graph \({G}_{0}\), denoted as \({v}_{i}, {v}_{i}\in {G}_{0}{, v}_{i} \in {\mathbb{R}}^{{d}_{1}}\). It is worth noting that we do not introduce sensor sequence embedding information here, and \({O}_{i}\) is only used to learn the graph structure.

We randomly lose a certain number of edges and node features in the graph \({G}_{0}\) with mask ratio ε. Then, we repeat this operation twice to obtain two new graphs \({G}_{1}\) and \({G}_{2}\). We next perform message aggregation and node updating for the two graphs by GNN, and for any node \({v}_{i}\), we obtain its re-characterization as

$${v}_{i}^{l+1}= \sigma \left(\sum_{r\in R}\sum_{j\in {N}_{i}^{r}}\frac{1}{{C}_{i,r}}{W}_{r}^{l}{v}_{j}^{l}+ {W}_{0}^{l}{v}_{i}^{l}\right),$$
(12)

where \({v}_{i}^{l+1}\) is the feature vector of node \(i\) in layer \(l\), \(R\) is the set of relations, \({W}_{r}^{l}\) is the weight matrix of relations in layer \(i\), \({W}_{0}^{l}\) is the bias vector in layer \(l\), \({C}_{i,r}\) the normalization factor, and \({N}_{i}^{r}\) is the set of nodes whose relation to node \(i\) is \(r\). In simpler terms, during the encoding process for each node, we calculate its feature vector by taking a weighted sum of the feature vectors of all the nodes connected to it in the previous layer. The weights assigned to each node feature vector are determined by a weight matrix associated with their relationship, along with a bias vector. This weighted sum is passed through a nonlinear activation function to introduce nonlinearity. Throughout this process, the weights undergo normalization to mitigate the influence of node degrees.

After data augmentation and graph node re-characterization, we perform positive and negative sampling. The traditional graph contrastive learning sampling strategy is to randomly fix the node of one of the views as the anchor point; only the same point in another view constitutes a positive sample pair, and the rest are negative sample pairs, with repeat traversal to obtain the set of all positive and negative sample pairs [39]. While this is intuitive and easy to implement, we found through experiments that such methods have limitations in heterogeneous graphs, and it is difficult to effectively use the representation of heterogeneous information between different node and edge types. To improve the sampling strategy, we sample positive and negative sample pairs in \({G}_{1}\) and \({G}_{2}\), and classify the relationship between sample pairs into two categories, one unrelated and the other inconsistent, where unrelated refers to sample pairs that are not directly connected, and inconsistent to those whose edges are connected but whose node types are inconsistent.

We first randomly sample a node in \(G_{1}\) to obtain \(v_{i} \in G_{1} ,{ }v_{i} { } \in { }{\mathbb{R}}^{{d_{1} }}\). Then, we find the set \({\text{M}}\) of neighboring nodes of node \(i\) in \(G_{2}\) to obtain \(v_{j}\). A positive sample pair can be expressed as \(P_{i} = (v_{i} , v_{j}\)), \(v_{i} \in G_{1} , (v_{j} \in { }G_{2} ) \cap (v_{j} \in {\text{M}}\)), and a negative sample pair as \(N_{i} = (v_{i} , v_{j}\)), \(v_{i} \in G_{1} , (v_{j} \in { }G_{2} ) \cap (v_{j} \notin M)\). Then, we perform pooling on the sample pairs. For the generalization of the model, we use sum-pooling to obtain \(P_{i} ,N_{i} \in {\mathbb{R}}^{{d_{1} }}\). We repeat sampling \(K_{1}\) times to obtain the set of positive and negative sample pairs denoted as \(P = \left\{ {P_{1} ,..,P_{{k_{1} }} } \right\}, P \in {\mathbb{R}}^{{k_{1} \times d_{1} }}\), \(N = \left\{ {N_{1} ,..,N_{{k_{1} }} } \right\}, N \in {\mathbb{R}}^{{k_{1} \times d_{1} }}\), and the objective function is

$$ \zeta_{1} = - \log \mathop \sum \limits_{i = 0}^{{k_{1} }} \mathop \sum \limits_{j = 0}^{{\theta k_{1} }} \frac{{\exp \left( {P_{i} (P_{j} )^{T} } \right)}}{{\exp \left( {N_{i} (N_{j} )^{T} /\tau } \right)}}, $$
(13)

where \({\uptau }\) is the temperature coefficient. It is worth noting that the number of positive and negative samples is the same \({K}_{1}\) at this time. However, through experimental analysis, we found that increasing the number of negative samples will improve the learning ability of the model, for which we will learn a hyperparameter \(\theta \) to control the sampling ratio of each group of positive and negative samples. The above sampling strategy is based on uncorrelated node structures.

The second sampling strategy is based on the inconsistency of node attributes. The difference with the first sampling strategy is that the first focuses on the characteristics of edges, in short, on whether or not they are connected as a basis for sampling positive and negative samples in two graphs. The second sampling strategy focuses on node characteristics, sampling different types of nodes on the two graphs as positive and negative sample pairs when they are already connected. Specifically, we randomly sample a node \(v_{i}^{\prime} \in G_{1}\), by finding the neighboring nodes of the \(v_{i}^{\prime}\) node combined with \({\text{M}}^{\prime} \) in \(G_{2}\), in which a node with the same node type is randomly selected to form a positive sample pair. When learning the graph structure, we mentioned that our graph structure divides the nodes into actuator and monitor types, and we can denote the set composed of the two types of sensor nodes, respectively, as \({\mathcal{A}} = \left\{ {v_{i}^{\prime } |type\left( {v_{i}^{\prime } } \right) = actuator} \right\}, {\mathcal{B}} = \left\{ {v_{i}^{\prime } |type\left( {v_{i}^{\prime } } \right) = monitor} \right\}\). A positive sample pair can be expressed as \(P_{i}^{\prime} = (v_{i}^{\prime} , v_{j}^{\prime}\)), \(v_{i}^{\prime} \in G_{1} , v_{j}^{\prime} \in {\text{M}}^{\prime} ,{\text{ ty}}pe\left( {v_{i}^{\prime} } \right) = type\left( {v_{j}^{\prime} } \right)\), and a negative sample pair as \(P_{i}^{\prime} = (v_{i}^{\prime} , v_{j}^{\prime}\)), \(v_{i}^{\prime} \in G_{1} ,v_{j}^{\prime} \in {\text{M}}^{\prime} ,{\text{ ty}}pe\left( {v_{i}^{\prime} } \right) \ne type\left( {v_{j}^{\prime} } \right)\). Then, we perform sum-pooling on the sample pairs to obtain \(P_{i}^{\prime} ,N_{i}^{\prime} \in R^{{d_{1} }}\). We repeat sampling \(k_{1}\) times, obtain the respective sets of positive and negative sample pairs \(P^{\prime} = \left\{ {P_{1}^{\prime} ,..,P_{{k_{1} }}^{\prime} } \right\},P^{\prime} \in {\mathbb{R}}^{{k_{1} \times d_{1} }}\), \(N^{\prime} = \left\{ {N_{1}^{\prime} ,..,N_{{k_{1} }}^{\prime} } \right\}, N^{\prime} \in {\mathbb{R}}^{{k_{1} \times d_{1} }}\), and obtain the objective function

$$ \zeta_{2} = - \log \mathop \sum \limits_{i = 0}^{{k_{1} }} \mathop \sum \limits_{j = 0}^{{\theta k_{1} }} \frac{{\exp \left( {P_{i}^{\prime} (P_{j}^{\prime} )^{T} } \right)}}{{\exp \left( {N_{i}^{\prime} (N_{j}^{\prime} )^{T} /\tau } \right)}}. $$
(14)

In this way, we obtain the spatial-dependency modeling objective function

$$ \zeta_{{{\text{spatio}}}} = { }\zeta_{1} + \zeta_{2} , $$
(15)

which introduces a constraint on our temporal prediction task, thereby yielding the comprehensive objective function for our model

$$ {\mathcal{L}} = \zeta_{{{\text{MSE}}}} + \zeta_{{{\text{spatio}}}} . $$
(16)

Anomaly detection

Having integrated spatio constraints into our prediction framework for sensor behavior, we compute an anomaly score to provide an explanation for anomalous behavior [15]. We calculate the discrepancy between the predicted and observed values of a sensor \(i\) at each time stamp \(t\) within the test set

$${E}_{t,i}=\left|{Y}_{t,i}-{S}_{t,i}\right|, t\in {T}_{{\text{test}}},i\in \left\{1,...,d\right\},$$
(17)

where \({T}_{{\text{test}}}\) denotes all the time stamps in the test set, \({Y}_{t,i}\in {Y}_{t}\). Considering the varying sensitivities among the sensors, we apply robust normalization to the calculated deviations

$${\psi }_{t,i}= \frac{{E}_{t,i}-{\mu }_{i}}{{\sigma }_{i}},$$
(18)

where \({\mu }_{i}\) and \({\sigma }_{i}\) are the median and inter-quartile range, respectively. Subsequently, we employ the maximum function to aggregate the anomaly scores of all sensors at time \(t\), yielding the time stamp anomaly score \({\psi }_{t,i}.\) If this surpasses the predefined threshold, we classify it as an anomaly occurring at that time. The way the thresholds are chosen can be optimized differently depending on the direction and distance [40], but to ensure fairness with baseline experiments, we use the maximum value of the system anomaly score at all moments in the validation set as the threshold [15, 18]. It is important to note that our training and validation sets solely consist of normal sample data, and only the test set contains abnormal samples. This is an important condition for us to use unsupervised training and to set thresholds.

Experiment

We performed experiments and conducted quantitative and qualitative analyses to compare our method with baseline approaches.

Dataset: The scarcity of high-dimensional series data originating from real-world industrial systems, incorporating anomalous instances, poses a challenge. However, there are two extensively employed cyber-physical systems (CPS) datasets available for research in time-series anomaly detection. These datasets, Secure Water Treatment (SWaT) and Water Distribution (WADI) [1], were generated and released by the iTrust Center for Research in Cybersecurity at the Singapore University of Technology and Design. Details are shown in Table 1.

Table 1 Details of SWaT and WADI datasets

Baselines: As our model is designed for anomaly detection based on multivariate time-series forecasting, our baselines fall into two categories. The first comprise outstanding work focused on detecting anomalies in multivariate time-series data, and can provide a visual comparison of our model’s performance. The second category consists of transformer models that have recently demonstrated excellent performance in multivariate time-series forecasting. The baselines are as follows:

PCA [41]: Discovers a low-dimensional projection that effectively captures the majority of variance present in the data. The anomaly score, in this context, refers to the reconstruction error associated with this projection;

KNN [42]: Employs the distance between each data point and its top \(k\) nearest neighbor as an anomaly score;

DAGMM [43]: Combines deep autoencoders and a Gaussian mixture model to generate a low-dimensional representation and reconstruction error for each observation;

AE [44]: Consisting of an encoder and a decoder, reconstructs data samples, utilizing the reconstruction error as a metric for detecting anomalies;

LSTMVAE [45]: To leverage the advantages of both LSTM and VAE, the feedforward network in a VAE is replaced by LSTM, allowing for the computation of the reconstruction error, which serves as an error score;

Mad-GAN [21]: By employing generative adversarial networks (GANs) in conjunction with a reconstruction-based approach, error scores are computed for each sample;

GDN [15]: Can capture both spatio and temporal dependencies, representing multivariate time-series data as graphs, and utilizing GNNs to learn the representations of nodes and edges within them. Learned representations are fed into a sensor future behavior prediction module, which enables the detection of anomalies in time-series data;

TranAD [18]: Utilizing an innovative self-attentive mechanism, it incorporates self-regulation grounded in focus scores for resilient multi-modal feature extraction. The model employs adversarial training for stability and integrates reconstruction loss for anomaly detection.

Informer [46]: Employing a multilayer Transformer architecture, this model enhances the weight calculation method for attention, and incorporates techniques such as time-varying positional encoding and length masking, which enable efficient processing of long sequences and accurate predictions across multiple time steps;

Autoformer [47]: This adaptive transformer model introduces an adaptive feature selection module and adaptive transformation module to dynamically learn the crucial features and transformation methods of time-series data, so as to enhance the accuracy and generalization of sequence prediction.

Non-stationary Transformer (Nsformer) [19]: Designed for non-stationary time-series data, a progressive learning mechanism allows for adaptive learning of the dynamic nature of a sequence. Information from both historical and future data is leveraged during the prediction process, resulting in improved accuracy of sequence prediction.


Evaluation metrics: To ensure generalizability and fairness, we chose the evaluation indicators in the literature: precision, recall, and F1-score [15, 21]

$${\text{Pre}}=\frac{{\text{TP}}}{{\text{TP}}+{\text{FP}}}$$
(19)
$${\text{Rec}}=\frac{{\text{TP}}}{{\text{TP}}+{\text{FN}}}$$
(20)
$${F}_{1}=2\times \frac{{\text{Pre}}\times {\text{Rec}}}{{\text{Pre}}+{\text{Rec}}},$$
(21)

where TP is the correctly detected anomaly, FP is the falsely detected anomaly, TN is the correctly assigned normal, and FN is the falsely assigned normal.


Implementation: We used the PyTorch-1.8.1 library to train all the models, and split the trained time-series into 90% training data and 10% validation data. We used the Adam optimizer with a learning rate of 0.01 and an epoch of 10. Some important hyperparameters are as follows: in the temporal module, the sliding window length was \({\text{w}} = 15\), the number of transformer encoder layers was \(L = 3\), the number of heads was 4, and \(d_{k}\)= 64; in the spatio module, \(\varepsilon = 0.2\),\({\text{ d}}_{1} = 64\), \(k = 20,k_{1} = 10, \theta = 5, \tau = 0.{25}\).

Research question 1: anomaly detection performance

We present the anomaly detection performance of our model and the baseline approaches in Table 2, in terms of precision, recall, and F1-score on the SWaT and WADI datasets. The results indicate that our model outperforms the baselines in terms of recall and F1-score on both datasets, achieving F1-scores of 0.82 and 0.59 for SWaT and WADI, respectively. While the GDN baseline achieves higher precision scores, the trade-off between precision and recall is inevitable. In practice, maintenance technicians with domain expertise tend to prioritize high sensitivity over specificity to avoid missing any critical events worthy of future reference [

Fig. 2
figure 2

Attention visualization on WADI dataset

To further illustrate the ability of the error scores constructed by our model to discriminate between abnormalities, we visualize a comparative plot of the distribution of abnormal scores for positive and negative samples. Given the balance of precision and recall, we visualize the distribution of GDN abnormality scores for comparison and contrast to facilitate comparative observations, as shown in Fig. 3. Compared to GDN, our model obtains a better distribution of normal/abnormal data, especially in the SWaT dataset, where error scores of normal data remain low and concentrated, indicating that our model can more effectively separate normal from abnormal embedding. In the WADI dataset, there are still many normal sample points with excessive error scores, and the presence of these noisy points is one of the main reasons why detection performance on the WADI dataset is inferior to that on SWaT.

Fig. 3
figure 3

Sample error score distribution

To explore the sensitivity of the hyperparameters in our model, we conducted tests on six important hyperparameters, as shown in Fig. 4, sliding window length (\(w\)), the number of transformer encoder layers (\(L\)), the number of node neighbors (K), the graph embedding dimension (\(D_{1}\)), the contrastive learning sample times(\(K_{1}\)), and sampling ratio (\(\theta\)). We can find many hyperparameters which have multiple optimal values. Considering the model inference speed, we chose relatively small values for all hyperparameters. It is also worth noting that the two most influential parameters are K and \(\theta\). As for K, we need to ensure that each node has enough neighbors, so that we are able to capture more useful information when we perform information transfer in the graph, but when our number of neighbors exceeds 20, an over-smoothing phenomenon occurs, and the flow of information in the whole graph tends to be like a fully connected graph. As for \(\theta\), when theta is too small, the difference between positive and negative samples cannot be fully explained and the model is underfitted, and when theta is too large a large amount of noise is learned and the inference efficiency is significantly reduced. Finally, the details in module time complexity and parameter are shown in Table 5.

Fig. 4
figure 4

Hyperparametric sensitivity experiments

Table 5 The detail in module time complexity and parameter