Abstract
The external attention mechanism offers a promising approach to enhance image anomaly detection (Hayakawa et al., in: IMPROVE, pp. 100-–110, 2023). Nevertheless, the effectiveness of this method is contingent upon the judicious selection of an intermediate layer with external attention. In this study, we performed a comprehensive series of experiments to clarify the mechanisms through which external attention improves detection performance. We assessed the performance of the LEA-Net (Hayakawa et al., in: IMPROVE, pp. 100–110, 2023), which implements layer-wise external attention, using MVTec AD and Plant Village datasets. The detection performances of the LEA-Net were compared with that of the baseline model under different anomaly maps generated by three unsupervised approaches. In addition, we investigated the relationship between the detection performance of LEA-Net and the selection of an attention point, which means an intermediate layer where external attention is applied. The findings reveal that the synergy between the dataset and the generated anomaly map influenced the effectiveness of the LEA-Net. For poorly localized anomaly maps, the selection of the attention point becomes a pivotal factor in determining detection efficiency. At shallow attention points, a well-localized attention map successfully notably improves the detection performance. For deeper attention points, the overall intensity of the attention map is essential; this intensity can be substantially amplified by layer-wise external attention, even for a low-intensity anomaly map. Overall, the results suggest that for layer-wise external attention, the positional attributes of anomalies hold greater significance than the overall intensity or visual appearance of the anomaly map.
Similar content being viewed by others
Avoid common mistakes on your manuscript.
Introduction
Anomaly detection serves as a critical technique for identifying anomalous patterns in voluminous datasets, holding particular relevance in the analysis of imaging data. This technology finds applications in diverse domains, including but not limited to medical diagnosis [2, 3], plant healthcare [4], surveillance video [5], and disaster detection [6, 7]. Recent advancements in deep learning have propelled a surge of scholarly interest in the development of automated anomaly detection methods for expansive image datasets. Based on machine learning research, these techniques can be categorized into three primary classes: supervised, semi-supervised, and unsupervised methodologies. Despite each approach’s unique merits and limitations, the predominant challenge is the efficient identification of anomalies based on a limited number of anomalous instances.
Convolutional neural networks (CNN) represent a prevalent architecture in the landscape of computer vision, offering robust solutions for tasks such as image recognition and segmentation. Utilizing substantial labeled datasets, CNN has achieved state-of-the-art image anomaly detection performance in real-world anomaly detection applications [6, 8]. Nonetheless, CNN-based anomaly detectors frequently grapple with the scarcity of labeled instances and a low incidence of anomalies. Several studies have developed strategies to ameliorate these constraints, including the incorporation of active learning [9] and the deployment of transfer learning [6] to enhance CNN’s learning efficiency.
Unsupervised learning methods have achieved wide acceptance in the domain of anomaly detection, primarily because they eliminate the need for labeled anomalous samples during the training phase. A conventional approach in unsupervised image anomaly detection relies on the utilization of deep convolutional auto-encoders to reconstruct normal images [10]. However, these auto-encoders sometimes falter in the precise reconstruction of fine structures, leading to the generation of excessively blurry images. To counter this limitation, Generative Adversarial Networks (GAN) has been introduced into the field. AnoGAN [11] pioneered the application of GAN for image anomaly detection. Moreover, AnoGAN has been adapted to the field assess color reconstructability, thereby enabling the sensitive detection of color anomalies [12]. In unsupervised anomaly detection, it is a commonly adopted practice to quantify the deviation between the original image and the reconstructed image as the Anomaly Score.
Although unsupervised anomaly detectors eliminate the need for labeling anomalous instances during training, they pose certain shortcomings. Primarily, these detectors are susceptible to overlooking subtle and minute anomalies because the Anomaly Score is predicated upon the distance between the standard and test images. Therefore, the effectiveness of unsupervised anomaly detectors is contingent upon the robust formulation of an Anomaly Score for specific objectives. Second, the appropriate threshold of the Anomaly Score should be carefully tuned to classify normal and anomalous instances accurately. This calibration frequently entails a laborious process of trial and error.
Recent advancements in visual attention mechanisms have garnered considerable traction in computer vision [13]. Attention branch network (ABN) incorporates a branching structure termed Attention Branch [14]. The attention maps from this branch serve as visual explications to describe the decision-making process within CNN. These attention maps have been demonstrated to improve CNN performance across various image classification tasks.
The visual attention mechanism has realized robust prediction for imbalanced data to utilize contrastive learning in anomaly detection [15]. This research raises the intriguing prospect of integrating visual attention into image anomaly detection schemes. Nonetheless, existing visual attention modules, including ABN, predominantly rely on self-attention mechanisms [16, 17]. Consequently, the quality of attention in these modules is intrinsically linked to the network’s overall performance, thereby limiting their direct applicability in enhancing the performance of anomaly detectors.
In a preceding study, the layer-wise external attention network (LEA-Net) was introduced to enhance CNN-based anomaly detection through the incorporation of an external attention mechanism. The external attention mechanism leverages prior knowledge from some external sources; LEA-Net utilizes the outputs of the other network pre-trained. As discussed, unsupervised and supervised anomaly detectors have their problems. To address these problems, LEA-Net consolidated supervised and unsupervised anomaly detection algorithms through the lens of a visual attention mechanism. The burgeoning advancements in visual attention mechanisms intimate the feasibility of leveraging prior knowledge in anomaly detection. The strategies described in [1] include the following:
-
The pre-existing knowledge concerning anomalies is articulated through an anomaly map constructed via the unsupervised learning of normal instances.
-
Subsequently, this anomaly map is transformed into an attention map by an auxiliary network.
-
The attention map is then incorporated into the intermediate layers of the anomaly detection network (ADN).
In line with this strategy, the effectiveness of layer-wise external attention in image anomaly detection was assessed through comprehensive experiments utilizing publicly accessible, real-world datasets. The findings revealed that layer-wise external attention reliably enhanced the performance of anomaly detectors, even with limited data. Further, the results suggested that the external attention mechanism can synergistically operate with the self-attention mechanism to enhance anomaly detection capabilities.
Although the external attention mechanism holds considerable promise for setting a new paradigm in image anomaly detection, its effectiveness depends on the judicious selection of an intermediate layer equipped with external attention. To illustrate the layer-wise external attention mechanism improving anomaly detection performance, we conducted a series of more in-depth experiments. The principal contributions of this research are stated as follows:
-
We introduced an embedding-based approach, a Patch Distribution Modeling Framework (PaDiM) [18], to generate an anomaly map along with reconstruction approaches.
-
We comparatively analyzed the performance of LEA-Net with that of baseline models under various conditions to clarify the modes through which external attention improves the detection performance of CNN.
-
We discerned that the presence of well-localized positional features on an anomaly map is instrumental in successfully implementing layer-wise external attention.
Related Work
A more straightforward methodology was employed for the automated detection of thyroid nodule lesions in X-ray computed tomography images [19]. This technique leverages binary segmentation results acquired from a U-Net as input for supervised image classifiers. The authors demonstrated that such preprocessing via binary segmentation significantly enhances anomaly detection accuracy in practical applications. Similarly, the convolutional adversarial variational autoencoder with guided attention (CAVGA) employs anomaly maps in a weakly supervised setting to localize anomalous areas [20]. Through empirical evaluations using the MVTec AD dataset, CAVGA achieved state-of-the-art performance. Both studies substantiate the considerable promise of incorporating visual attention maps in image anomaly detection.
The concept of visual attention pertains to the selective refinement or amplification of image features for recognition tasks. The human perceptual system prioritizes information germane to the task over comprehensive data processing [21, 22]. Visual attention mechanisms emulate this human faculty in the context of image classification [\(28\times 28\), whereas the embedding vectors incorporated 1000 embedding vectors with randomly selected dimensions.
The parameters of LEA-Net, including AAN (ResNet-based) and ADN (ResNet-18), were optimized using Adam optimizer with a learning rate of 0.0001. The momentums of Adam were consistent at \(\beta _1 = 0.9\) and \(\beta _2 = 0.999\). A total of 100 epochs were selected to update these parameters, with the batch size maintained at 16. Computational tasks were executed on a system equipped with a GeForce RTX 2028 Ti GPU, running Python 3.10.12 and CUDA 11.8.89.
Comparison of Supervised Networks and LEA-Net
The primary objective of LEA-Net is to augment the performance of the baseline network, which is trained in a purely supervised fashion in the realm of anomaly detection. To assess the efficacy of the external attention mechanism, we conducted comparative analyses on image-level anomaly detection performance across several models: (i) ResNet-18 as the baseline, (ii) ResNet-50, (iii) LEA-Net informed by anomaly maps generated through a color reconstruction task, denoted as LEA-Net (Color Reconstruction), (iv) LEA-Net guided by anomaly maps generated through auto-encoding, referred to as LEA-Net (Auto-Encoding), and (v) LEA-Net shaped by anomaly maps generated based on PaDiM, labelled as LEA-Net (PaDiM).
For each of these models (i)–(v), the network output threshold was fixed at 0.5 to facilitate the computation of \(F_1\) scores. Figure 5 reveals an enhancement in \(F_1\) scores for the baseline model (ResNet-18) due to the implementation of the external attention mechanism. The horizontal axis demarcates the categories of datasets employed in the experiments, while the bars signify the average \(F_1\) scores ascertained through cross-validation. We report only the maximal \(F_1\) score among the five selected attention points. Additionally, error bars represent the standard deviation, and bars corresponding to the highest average \(F_1\) score in each category are marked with a black inverted triangle.
As indicated in Fig. 5, the external attention mechanism substantially elevates the baseline model’s performance across all datasets. Most notably, the \(F_1\) scores in MVTec AD’s carpet, tile, and wood categories witnessed an average improvement of approximately \(14.3\%\). Interestingly, ResNet-50 underperformed compared to ResNet-18 in specific instances, such as the carpet category. Furthermore, the parameter counts for ResNet- 18, ResNet-50, and LEA-Net are \(11.2\textrm{M}\), \(23.5\textrm{M}\), and \(15.6\textrm{M}\), respectively. This observation substantiates that the sheer number of model parameters is not pivotal in achieving the superior performance of LEA-Net.
Comparison of Unsupervised Networks and LEA-Net
To rigorously evaluate the efficacy of LEA-Net, we juxtaposed its performance with that of a straightforward thresholding method applied to anomaly maps. We assessed the image-level anomaly detection capability by computing \(F_1\) scores in the following settings: (i) LEA-Net employing anomaly maps generated through color reconstruction, denoted as LEA-Net (Color Reconstruction), (ii) LEA-Net utilizing anomaly maps formed via auto-encoding, termed LEA-Net (Auto-Encoding), (iii) LEA-Net with anomaly maps generated based on PaDiM, labelled as LEA-Net (PaDiM), (iv) Direct thresholding of anomaly maps originated from color reconstruction, identified as Color Reconstruction, (v) Direct thresholding of anomaly maps produced through auto-encoding, referred to as Auto-Encoding, and (vi) Direct thresholding of anomaly maps emanating from PaDiM, designated as PaDiM. For configurations (i)–(iii), the threshold for calculating \(F_1\) scores is set at 0.5.
As depicted in Fig. 6, LEA-Net consistently outperforms the straightforward thresholding approach in the contexts of both Color Reconstruction and Auto-Encoding across all datasets. Specifically, it is noteworthy that LEA-Net considerably enhances performance across most of the PlantVillage dataset. Moreover, PaDiM yields superior results compared to LEA-Net on the MVTec AD dataset, except for the hazelnuts category.
Dependence on the Selection of the Attention Points
In this section, we focused on evaluating the influence of attention point selection on the efficacy of anomaly detection. As depicted in Fig. 7, we contrast the detection performance of LEA-Net when configured with different attention points. The horizontal axis portrays the generative methods employed for the anomaly maps of the LEA-Net, whereas the vertical axis represents the \(F_1\) score. Each bar signifies the average \(F_1\) scores, and an error bar indicates the standard deviation. The quintet of bars arrayed along the horizontal axis illustrates the performance of LEA-Net corresponding to each attention point. The results in Fig. 7 indicate that the anomaly detection performance depends on the attention points, especially for PaDiM. However, in cases of Color Reconstruction and Auto-Encoding, we did not observe such dependencies except for the carpet.
Discussion
Specifically, Fig. 7 demonstrates that the choice of attention points significantly influences anomaly detection performance, an influence that concurrently depends on the type of anomaly in use. To elucidate this, we conducted a comparative study of attention maps at various points, as presented in Fig. 8. These maps are accompanied by their corresponding \(F_1\) scores for the MVTec AD tile category. In the figure, columns (a)–(c) correlate to distinct anomaly maps derived from separate reconstruction tasks: (a) corresponds to Color Reconstruction, (b) to Auto-Encoding, and (c) to PaDiM. Well-localized anomaly maps are observed to substantially enhance detection efficacy when external attention is applied at the first through fourth attention points. Conversely, poorly localized, excessive attention maps tend to compromise performance, except when external attention is deployed at the final attention point. The emphasis of positional information on anomaly is essential for shallow attention points, whereas that of abnormality is critical for deep attention points. As positional information is beneficial for detecting anomalies, we can expect that the hierarchical representation from position to abnormality is vital for external attention to promote anomaly detection performance.
Conclusion
In this study, we have scrutinized the role of the external attention mechanism in enhancing the detection performance of CNN. To use the MVTec AD and PlantVillage datasets for empirical analysis, we have ascertained that layer-wise external attention effectively augments the performance of the baseline model in anomaly detection. The present findings indicate that the effectiveness of external attention is contingent upon the compatibility between the dataset and the anomaly map. Moreover, the data suggests that the focus on positional information is pivotal for shallower attention points, whereas the emphasis on abnormality becomes crucial at deeper attention points. Intriguingly, we also observed that the overall intensity of appreciably amplified by external attention, even when dealing with low-intensity anomaly maps. In conclusion, the positional features within anomalies assume greater importance than the overall intensity and appearance of the anomaly map. Therefore, a well-localized positional feature within an anomaly map serves as a key determinant in the effectiveness of the layer-wise external attention for anomaly detection.
Data Availability Statement
The data used in this study will be made available by contacting the authors directly.
References
Hayakawa T, Nakanishi K, Katafuchi R, Tokunaga T. Layer-wise external attention for efficient deep anomaly detection. In: IMPROVE. 2023. p. 100–110.
Rezvantalab A, Safigholi H, Karimijeshni S. Dermatologist level dermoscopy skin cancer classification using different deep learning convolutional neural networks algorithms. 2018. ar**v:1810.10348.
Cao C, Liu F, Tan H, Song D, Shu W, Li W, Zhou Y, Bo X, **e Z. Deep learning and its applications in biomedicine. Genom Proteom Bioinform. 2018;16(1):17–32.
Ferentinos KP. Deep learning models for plant disease detection and diagnosis. Comput Electron Agric. 2018;145:311–8.
Roka S, Diwakar M. Cvit: a convolution vision transformer for video abnormal behavior detection and localization. SN Comput Sci. 2023;4(6):829.
Minhas MS, Zelek J. Anomaly detection in images. 2019. ar**v:1905.13147.
Natarajan V, Mao S, Chia L-T. Salient textural anomaly proposals and classification for metal surface anomalies. In: 2019 IEEE 31st international conference on tools with artificial intelligence (ICTAI). 2019. p. 621–28. https://doi.org/10.1109/ICTAI.2019.00092.
Hughes DP, Salathe M. An open access repository of images on plant health to enable the development of mobile disease diagnostics. 2016. ar**v:1511.08060.
Görnitz N, Kloft M, Rieck K, Brefeld U. Toward supervised anomaly detection. J Artif Intell Res (JAIR). 2013;46:235–62.
Haselmann M, Gruber DP, Tabatabai P. Anomaly detection using deep learning based image completion. 2018.
Schlegl T, Seebock P, Waldstein SM, Schmidt-Erfurth U, Langs G. Unsupervised anomaly detection with generative adversarial networks to guide marker discovery. In: International conference on information processing in medical imaging. 2017. p. 146–57.
Katafuchi R, Tokunaga T. Image-based plant disease diagnosis with unsupervised anomaly detection based on reconstructability of colors. 2021. p. 112–20. https://doi.org/10.5220/0010463201120120.
Zhao H, Jia J, Koltun V. Exploring self-attention for image recognition. In: Proceedings of the IEEE computer society conference on computer vision and pattern recognition. 2020. p. 10073–82. https://doi.org/10.1109/CVPR42600.2020.01009. ar**v:2004.13621.
Fukui H, Hirakawa T, Yamashita T, Fujiyoshi H. Attention branch network: learning of attention mechanism for visual explanation. 2019.
Takimoto H, Seki J, Situju SF, Kanagawa A. Anomaly detection using siamese network with attention mechanism for few-shot learning. Appl Artif Intell. 2022;36(1):2094885.
Hu J, Shen L, Albanie S, Sun G, Wu E. Squeeze-and-excitation networks. IEEE Trans Pattern Anal Mach Intell. 2020;42(8): 2011–23. https://doi.org/10.1109/TPAMI.2019.2913372. ar**v:1709.01507.
Woo S, Park J, Lee JY, Kweon IS. CBAM: convolutional block attention module. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) LNCS, vol. 11211. 2018. p. 3–19. https://doi.org/10.1007/978-3-030-01234-2_1. ar**v:1807.06521.
Defard T, Setkov A, Loesch A, Audigier R. Padim: a patch distribution modeling framework for anomaly detection and localization. In: International conference on pattern recognition. Springer. 2021. p. 475–89.
Li W, Cheng S, Qian K, Yue K, Liu H. Automatic recognition and classification system of thyroid nodules in CT images based on CNN. Comput Intell Neurosci. 2021. https://doi.org/10.1155/2021/5540186.
Venkataramanan S, Peng K-C, Singh RV, Mahalanobis A. Attention guided anomaly localization in images. In: European conference on computer vision. Springer. 2020. p. 485–503.
Reynolds JH, Chelazzi L. Attentional modulation of visual processing. Annu Rev Neurosci. 2004;27:611–47.
Chun MM, Golomb JD, Turk-Browne NB. A taxonomy of external and internal attention. Annu Rev Psychol. 2011;62:73–101.
Wang F, Jiang M, Qian C, Yang S, Li C, Zhang H, Wang X, Tang X. Residual attention network for image classification. Proceedings—30th IEEE conference on computer vision and pattern recognition, CVPR 2017 2017. 2017. p. 6450–58. https://doi.org/10.1109/CVPR.2017.683. ar**v:1704.06904.
Lee H, Kim HE, Nam H. SRM: a style-based recalibration module for convolutional neural networks. In: Proceedings of the IEEE international conference on computer vision 2019. 2019. p. 1854–62. https://doi.org/10.1109/ICCV.2019.00194.
Wang Q, Wu B, Zhu P, Li P, Zuo W, Hu Q. ECA-Net: efficient channel attention for deep convolutional neural networks. In: Proceedings of the IEEE computer society conference on computer vision and pattern recognition. 2020. 11531–39. https://doi.org/10.1109/CVPR42600.2020.01155. ar**v:1910.03151.
Yang L, Zhang R-Y, Li L, **e X. Simam: a simple, parameter-free attention module for convolutional neural networks. In: International conference on machine learning. PMLR. 2021. p. 11863–74.
Zagoruyko S, Komodakis N. Paying more attention to attention: improving the performance of convolutional neural networks via attention transfer. In: 5th international conference on learning representations, ICLR 2017—conference track proceedings. 2017. p. 1–13. ar**v:1612.03928.
Hinton G, Vinyals O, Dean J. Distilling the knowledge in a neural network. 2015.
Ronneberger O, Fischer P, Brox T. U-net: convolutional networks for biomedical image segmentation. In: International conference on medical image computing and computer-assisted intervention. Springer; 2015. p. 234–41.
Sharma G, Wu W, Dalal EN. The CIEDE2000 color-difference formula: implementation notes, supplementary test data, and mathematical observations. Color Res Appl. 2005;30(1):21–30.
He K, Zhang X, Ren S, Sun J. Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition. 2016. p. 770–78.
Deng J, Dong W, Socher R, Li L-J, Li K, Fei-Fei L. Imagenet: a large-scale hierarchical image database. In: 2009 IEEE conference on computer vision and pattern recognition. IEEE; 2009. p. 248–55.
Lu L. Dying relu and initialization: theory and numerical examples. Commun Comput Phys. 2020;28(5):1671–706. https://doi.org/10.4208/cicp.oa-2020-0165.
Bergmann P, Fauser M, Sattlegger D, Steger C. MVTec AD—a comprehensive real-world dataset for unsupervised anomaly detection. In: 2019 IEEE/CVF conference on computer vision and pattern recognition (CVPR). 2019.
Mohanty SP. PlantVillage-Dataset. GitHub. 2023.
Acknowledgements
The authors thank the ENAGO Group (English Editing Company) for editing a draft of this paper.
Funding
This work was supported by JSPS KAKENHI Grant Number 22K12169, JST PREST Grant Number JPMJPR1875, and NEDO Intensive Support Program for Young Promising Researchers Grant Number 21W2K034.
Author information
Authors and Affiliations
Contributions
Authors Mr. Keiichi Nakanishi, Mr. Ryo Shiroma, and Dr. Terumasa Tokunaga conducted the experiments, analyzed the results, and wrote the manuscript. Author Mr. Tokihisa Hayakawa also contributed to writing the manuscript. Author Mr. Ryoya Katafuchi provided significant intellectual content in this study.
Corresponding authors
Ethics declarations
Conflict of interest
The authors declare that they have no conflict of interest.
Research involving human and/or animals
Not applicable.
Informed consent
Not applicable.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Nakanishi, K., Shiroma, R., Hayakawa, T. et al. Layer-Wise External Attention by Well-Localized Attention Map for Efficient Deep Anomaly Detection. SN COMPUT. SCI. 5, 592 (2024). https://doi.org/10.1007/s42979-024-02912-3
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s42979-024-02912-3