Introduction

Anomaly detection serves as a critical technique for identifying anomalous patterns in voluminous datasets, holding particular relevance in the analysis of imaging data. This technology finds applications in diverse domains, including but not limited to medical diagnosis [2, 3], plant healthcare [4], surveillance video [5], and disaster detection [6, 7]. Recent advancements in deep learning have propelled a surge of scholarly interest in the development of automated anomaly detection methods for expansive image datasets. Based on machine learning research, these techniques can be categorized into three primary classes: supervised, semi-supervised, and unsupervised methodologies. Despite each approach’s unique merits and limitations, the predominant challenge is the efficient identification of anomalies based on a limited number of anomalous instances.

Convolutional neural networks (CNN) represent a prevalent architecture in the landscape of computer vision, offering robust solutions for tasks such as image recognition and segmentation. Utilizing substantial labeled datasets, CNN has achieved state-of-the-art image anomaly detection performance in real-world anomaly detection applications [6, 8]. Nonetheless, CNN-based anomaly detectors frequently grapple with the scarcity of labeled instances and a low incidence of anomalies. Several studies have developed strategies to ameliorate these constraints, including the incorporation of active learning [9] and the deployment of transfer learning [6] to enhance CNN’s learning efficiency.

Unsupervised learning methods have achieved wide acceptance in the domain of anomaly detection, primarily because they eliminate the need for labeled anomalous samples during the training phase. A conventional approach in unsupervised image anomaly detection relies on the utilization of deep convolutional auto-encoders to reconstruct normal images [10]. However, these auto-encoders sometimes falter in the precise reconstruction of fine structures, leading to the generation of excessively blurry images. To counter this limitation, Generative Adversarial Networks (GAN) has been introduced into the field. AnoGAN [11] pioneered the application of GAN for image anomaly detection. Moreover, AnoGAN has been adapted to the field assess color reconstructability, thereby enabling the sensitive detection of color anomalies [12]. In unsupervised anomaly detection, it is a commonly adopted practice to quantify the deviation between the original image and the reconstructed image as the Anomaly Score.

Although unsupervised anomaly detectors eliminate the need for labeling anomalous instances during training, they pose certain shortcomings. Primarily, these detectors are susceptible to overlooking subtle and minute anomalies because the Anomaly Score is predicated upon the distance between the standard and test images. Therefore, the effectiveness of unsupervised anomaly detectors is contingent upon the robust formulation of an Anomaly Score for specific objectives. Second, the appropriate threshold of the Anomaly Score should be carefully tuned to classify normal and anomalous instances accurately. This calibration frequently entails a laborious process of trial and error.

Recent advancements in visual attention mechanisms have garnered considerable traction in computer vision [13]. Attention branch network (ABN) incorporates a branching structure termed Attention Branch [14]. The attention maps from this branch serve as visual explications to describe the decision-making process within CNN. These attention maps have been demonstrated to improve CNN performance across various image classification tasks.

The visual attention mechanism has realized robust prediction for imbalanced data to utilize contrastive learning in anomaly detection [15]. This research raises the intriguing prospect of integrating visual attention into image anomaly detection schemes. Nonetheless, existing visual attention modules, including ABN, predominantly rely on self-attention mechanisms [16, 17]. Consequently, the quality of attention in these modules is intrinsically linked to the network’s overall performance, thereby limiting their direct applicability in enhancing the performance of anomaly detectors.

In a preceding study, the layer-wise external attention network (LEA-Net) was introduced to enhance CNN-based anomaly detection through the incorporation of an external attention mechanism. The external attention mechanism leverages prior knowledge from some external sources; LEA-Net utilizes the outputs of the other network pre-trained. As discussed, unsupervised and supervised anomaly detectors have their problems. To address these problems, LEA-Net consolidated supervised and unsupervised anomaly detection algorithms through the lens of a visual attention mechanism. The burgeoning advancements in visual attention mechanisms intimate the feasibility of leveraging prior knowledge in anomaly detection. The strategies described in [1] include the following:

  • The pre-existing knowledge concerning anomalies is articulated through an anomaly map constructed via the unsupervised learning of normal instances.

  • Subsequently, this anomaly map is transformed into an attention map by an auxiliary network.

  • The attention map is then incorporated into the intermediate layers of the anomaly detection network (ADN).

In line with this strategy, the effectiveness of layer-wise external attention in image anomaly detection was assessed through comprehensive experiments utilizing publicly accessible, real-world datasets. The findings revealed that layer-wise external attention reliably enhanced the performance of anomaly detectors, even with limited data. Further, the results suggested that the external attention mechanism can synergistically operate with the self-attention mechanism to enhance anomaly detection capabilities.

Although the external attention mechanism holds considerable promise for setting a new paradigm in image anomaly detection, its effectiveness depends on the judicious selection of an intermediate layer equipped with external attention. To illustrate the layer-wise external attention mechanism improving anomaly detection performance, we conducted a series of more in-depth experiments. The principal contributions of this research are stated as follows:

  • We introduced an embedding-based approach, a Patch Distribution Modeling Framework (PaDiM) [18], to generate an anomaly map along with reconstruction approaches.

  • We comparatively analyzed the performance of LEA-Net with that of baseline models under various conditions to clarify the modes through which external attention improves the detection performance of CNN.

  • We discerned that the presence of well-localized positional features on an anomaly map is instrumental in successfully implementing layer-wise external attention.

Related Work

A more straightforward methodology was employed for the automated detection of thyroid nodule lesions in X-ray computed tomography images [19]. This technique leverages binary segmentation results acquired from a U-Net as input for supervised image classifiers. The authors demonstrated that such preprocessing via binary segmentation significantly enhances anomaly detection accuracy in practical applications. Similarly, the convolutional adversarial variational autoencoder with guided attention (CAVGA) employs anomaly maps in a weakly supervised setting to localize anomalous areas [20]. Through empirical evaluations using the MVTec AD dataset, CAVGA achieved state-of-the-art performance. Both studies substantiate the considerable promise of incorporating visual attention maps in image anomaly detection.

The concept of visual attention pertains to the selective refinement or amplification of image features for recognition tasks. The human perceptual system prioritizes information germane to the task over comprehensive data processing [21, 22]. Visual attention mechanisms emulate this human faculty in the context of image classification [\(28\times 28\), whereas the embedding vectors incorporated 1000 embedding vectors with randomly selected dimensions.

The parameters of LEA-Net, including AAN (ResNet-based) and ADN (ResNet-18), were optimized using Adam optimizer with a learning rate of 0.0001. The momentums of Adam were consistent at \(\beta _1 = 0.9\) and \(\beta _2 = 0.999\). A total of 100 epochs were selected to update these parameters, with the batch size maintained at 16. Computational tasks were executed on a system equipped with a GeForce RTX 2028 Ti GPU, running Python 3.10.12 and CUDA 11.8.89.

Comparison of Supervised Networks and LEA-Net

The primary objective of LEA-Net is to augment the performance of the baseline network, which is trained in a purely supervised fashion in the realm of anomaly detection. To assess the efficacy of the external attention mechanism, we conducted comparative analyses on image-level anomaly detection performance across several models: (i) ResNet-18 as the baseline, (ii) ResNet-50, (iii) LEA-Net informed by anomaly maps generated through a color reconstruction task, denoted as LEA-Net (Color Reconstruction), (iv) LEA-Net guided by anomaly maps generated through auto-encoding, referred to as LEA-Net (Auto-Encoding), and (v) LEA-Net shaped by anomaly maps generated based on PaDiM, labelled as LEA-Net (PaDiM).

For each of these models (i)–(v), the network output threshold was fixed at 0.5 to facilitate the computation of \(F_1\) scores. Figure 5 reveals an enhancement in \(F_1\) scores for the baseline model (ResNet-18) due to the implementation of the external attention mechanism. The horizontal axis demarcates the categories of datasets employed in the experiments, while the bars signify the average \(F_1\) scores ascertained through cross-validation. We report only the maximal \(F_1\) score among the five selected attention points. Additionally, error bars represent the standard deviation, and bars corresponding to the highest average \(F_1\) score in each category are marked with a black inverted triangle.

As indicated in Fig. 5, the external attention mechanism substantially elevates the baseline model’s performance across all datasets. Most notably, the \(F_1\) scores in MVTec AD’s carpet, tile, and wood categories witnessed an average improvement of approximately \(14.3\%\). Interestingly, ResNet-50 underperformed compared to ResNet-18 in specific instances, such as the carpet category. Furthermore, the parameter counts for ResNet- 18, ResNet-50, and LEA-Net are \(11.2\textrm{M}\), \(23.5\textrm{M}\), and \(15.6\textrm{M}\), respectively. This observation substantiates that the sheer number of model parameters is not pivotal in achieving the superior performance of LEA-Net.

Fig. 5
figure 5

Comparison of \(F_1\) scores for purely supervised networks and that for LEA-Net

Comparison of Unsupervised Networks and LEA-Net

To rigorously evaluate the efficacy of LEA-Net, we juxtaposed its performance with that of a straightforward thresholding method applied to anomaly maps. We assessed the image-level anomaly detection capability by computing \(F_1\) scores in the following settings: (i) LEA-Net employing anomaly maps generated through color reconstruction, denoted as LEA-Net (Color Reconstruction), (ii) LEA-Net utilizing anomaly maps formed via auto-encoding, termed LEA-Net (Auto-Encoding), (iii) LEA-Net with anomaly maps generated based on PaDiM, labelled as LEA-Net (PaDiM), (iv) Direct thresholding of anomaly maps originated from color reconstruction, identified as Color Reconstruction, (v) Direct thresholding of anomaly maps produced through auto-encoding, referred to as Auto-Encoding, and (vi) Direct thresholding of anomaly maps emanating from PaDiM, designated as PaDiM. For configurations (i)–(iii), the threshold for calculating \(F_1\) scores is set at 0.5.

As depicted in Fig. 6, LEA-Net consistently outperforms the straightforward thresholding approach in the contexts of both Color Reconstruction and Auto-Encoding across all datasets. Specifically, it is noteworthy that LEA-Net considerably enhances performance across most of the PlantVillage dataset. Moreover, PaDiM yields superior results compared to LEA-Net on the MVTec AD dataset, except for the hazelnuts category.

Fig. 6
figure 6

Comparison of \(F_1\) scores for automatic threshold tuning and that for LEA-Net

Dependence on the Selection of the Attention Points

In this section, we focused on evaluating the influence of attention point selection on the efficacy of anomaly detection. As depicted in Fig. 7, we contrast the detection performance of LEA-Net when configured with different attention points. The horizontal axis portrays the generative methods employed for the anomaly maps of the LEA-Net, whereas the vertical axis represents the \(F_1\) score. Each bar signifies the average \(F_1\) scores, and an error bar indicates the standard deviation. The quintet of bars arrayed along the horizontal axis illustrates the performance of LEA-Net corresponding to each attention point. The results in Fig. 7 indicate that the anomaly detection performance depends on the attention points, especially for PaDiM. However, in cases of Color Reconstruction and Auto-Encoding, we did not observe such dependencies except for the carpet.

Fig. 7
figure 7

Comparison of \(F_1\) scores for different attention points

Discussion

Specifically, Fig. 7 demonstrates that the choice of attention points significantly influences anomaly detection performance, an influence that concurrently depends on the type of anomaly in use. To elucidate this, we conducted a comparative study of attention maps at various points, as presented in Fig. 8. These maps are accompanied by their corresponding \(F_1\) scores for the MVTec AD tile category. In the figure, columns (a)–(c) correlate to distinct anomaly maps derived from separate reconstruction tasks: (a) corresponds to Color Reconstruction, (b) to Auto-Encoding, and (c) to PaDiM. Well-localized anomaly maps are observed to substantially enhance detection efficacy when external attention is applied at the first through fourth attention points. Conversely, poorly localized, excessive attention maps tend to compromise performance, except when external attention is deployed at the final attention point. The emphasis of positional information on anomaly is essential for shallow attention points, whereas that of abnormality is critical for deep attention points. As positional information is beneficial for detecting anomalies, we can expect that the hierarchical representation from position to abnormality is vital for external attention to promote anomaly detection performance.

Fig. 8
figure 8

Attention maps of LEA-Net at each attention point for MVTec AD tile

Conclusion

In this study, we have scrutinized the role of the external attention mechanism in enhancing the detection performance of CNN. To use the MVTec AD and PlantVillage datasets for empirical analysis, we have ascertained that layer-wise external attention effectively augments the performance of the baseline model in anomaly detection. The present findings indicate that the effectiveness of external attention is contingent upon the compatibility between the dataset and the anomaly map. Moreover, the data suggests that the focus on positional information is pivotal for shallower attention points, whereas the emphasis on abnormality becomes crucial at deeper attention points. Intriguingly, we also observed that the overall intensity of appreciably amplified by external attention, even when dealing with low-intensity anomaly maps. In conclusion, the positional features within anomalies assume greater importance than the overall intensity and appearance of the anomaly map. Therefore, a well-localized positional feature within an anomaly map serves as a key determinant in the effectiveness of the layer-wise external attention for anomaly detection.