Abstract
Transformer networks have excellent performance in various different vision tasks, especially object detection. However, in practical applications, Transformer is difficult to use on-board due to its large computational complexity. In this paper, we propose a new approach for reducing the computation of self-attention, which is called conv-attention. Different from the work of self-attention, conv-attention is inspired by NetVLAD and adopts the probability obtained by soft classification to replace the similarity calculation between query and key. Moreover, we combine the three convolution operations for computing the attention matrix in order to reduce the computational effort. Using the Swin Transformer as a comparison, experiments show that the parameters and FLOPs are reduced by 15% and 16%. Meanwhile, MAP is improved in vehicle object detection, including both FILR and Nuscenses datasets.
Similar content being viewed by others
Avoid common mistakes on your manuscript.
1 Introduction
People acquire a large amount of information in their daily life, like visual information, sound information, tactile information, etc, while process these large amounts of information quickly, efficiently and filter out the needed information. The human visual processing system tends to focus on the main part of the image and ignore irrelevant information, which is the attention mechanism of the human biological system [4, 5, 28]. Attention mechanism was first introduced by Bahdanau et al. [2] to calculate the weight of input data to highlight the degree of influence of a certain input on the output (Fig. 1). Today, attention mechanism is widely used in various areas of artificial intelligence [9, 20, 33], such as Natural Language Processing (NLP), Speech Recognition and Computer Vision (CV).
In computer vision, the previous attention mechanism is mainly based on convolutional neural networks, such as SENet [12], Coordinate Attention [11], CBAM [32], etc. However, transformer introduced self-attention, a more efficient, scalable and domain-independent architecture. ViT [9], a large Transformer based model that uses self-attention directly on image blocks, is the first major job which outperforms the traditional CNN model. Most of the Transformer [30] are structured into encoder and decoder, convert the input data into vectors, following operations such as multi-head self-attention, normalization, and fully connected. Such a structure gives Transformer great flexibility in vision tasks.
Self-attention is the most important part in Transformer, and its operation is proportional to the square of the sequence length. When the length of the sequence is too long, the attention matrix is very troublesome to calculate. Hence, improvements have been made to reduce computational effort in the field of NLP, such as local attention, which computes only attention of adjacent elements. Reformer [16] and Routing Transformer [27] use clustering to reduce the computational effort. FNet [17] replaces the computation of self-attention with Fourier Transform. In Computer Vision, since pictures have higher pixel resolution than text, longer sequences are needed to represent them [14]. Swin Transformer’s [20] solution is to use a sliding window so that attention is computed only within the window, which is less computationally intensive. In the Computer Vision field, there is also necessity for methods to reduce the amount of computation.
In practical engineering applications, the Yolo family of algorithms is often used for vehicle object detection. Although Transformers has better performance, it is difficult to be deployed directly to vehicles due to its large number of parameters and computation. NetVLAD [1], however, is widely used in the field of computer vision due to its excellent computing speed. In this paper, we learn ideas from NetVLAD to improve the performance of Transformer for in-vehicle deployment.
During our research, we found that the main role of NetVLAD is clustering and dimensionality reduction, and the main purpose of the attention mechanism is to reduce the interference of useless information. Main roles of both NetVLAD and self-attention are extracting the main components. In self-attention, the attention matrix is generated using the calculation of similarity, which is similar to the way of classification aggregation, i.e., given two vectors, the probability that an element in each vector belongs to the other vector is calculated. Based on the similarity of the two computational approaches and principles, we propose conv-attention, which combines the efficiency of NetVLAD and the completeness of self-attention. By using the conv-attention in Swin Transformer, we found that the number of parameters can be effectively reduced and the accuracy in vehicle detection is also improved. The work in this paper is mainly as follows:
-
1.
We reveal the principle of attention matrix generation in self-attention by comparing with NetVLAD. And we optimize the calculation of self-attention to avoid too many matrix operations.
-
2.
We have proposed a new method of attention calculation which is called conv-attention, and apply it in Swin Transformer. FLOPs can be reduced by an average of 15.8%, and Params can be reduced by an average of 16%.
-
3.
The Extension experiment shows that when YOLOX is used as the detection head, the Swin Transformer with conv attention performs better in vehicle target detection. On the infrared FILR dataset, the MAP is improved by 2.46%. On the autopilot Nuscenses dataset, MAP is improved by 9.26%.
2 Related Work
NetVLAD In computer vision, visual location recognition is a challenging task. In early work, bag-of-words (BoW [26]) based models are widely used, which rely on hand-craft features (SIFT [22], SURF [3], HoG [8], Gist [23]). And along with the development of deep learning, Fisher Vector (FV [25]) and VLAD became powerful alternatives to hand-craft features.
VLAD [13] is called “Vector of Locally Aggregated Descriptors”, as a simplified version of FV, calculates the residuals of descriptors and clustering centers. The feature vectors of the whole image are obtained for weighted summation. NetVLAD [1] was proposed by Relja et al. its most important feature is that it can be reverse propagation training. Also, it can be inserted as a convolutional layer into any feature extraction network as a module. More and more algorithms [15] based on NetVLAD are proposed.
Vision Transformer Transformer was first used for machine translation tasks and has been used in the field of NLP, which has multi-head self-attention at its core. In 2020, Transformer [30] was introduced to computer vision, ViT [7] revisits the design of spatial attention and proposes spatially separable self-attention. In Twins, attention is grouped by spatial dimensions, and local self-attention is calculated separately and then fused. Here, none of these methods attempt to tune self-attention, but simply optimize the computation in a way similar to employing local attention.
Attentional Mechanism In computer vision, attention mechanism provides great benefits, which mainly include channel attention, spatial attention, temporal attention, and branch attention, and some combinations of them. Self-attention originated in the field of NLP for sequence-to-sequence tasks, which allows machines to better comprehend the context and meaning of a given sentence or document. The Attention matrix (or Alignment functions) of the attention mechanism has evolved along with it in various variants.
At the same time, due to the large computational effort of the attention matrix, there are many works done to optimize the computation. Reformer [16] replaced the dot product attention in the transformer with local position-sensitive hash attention. In Routing Transformer [27], the sparse attention matrix was computed by clustering the key and query using small-batch K-means clustering based on low-rank sparse patterns that rely on K-means. FNet [17] uses Fourier Transform instead of attention generation and proposes attention may not be the principal component driving the performance of Transformers.
All of these methods reduce the computational effort by optimizing the computation, but none of them avoids the complexity caused by matrix operations. In this paper, the proposed approach regards attention as a classification problem and uses convolution to obtain the attention matrix directly, which will effectively reduce the computational effort.
3 Method
In order to effectively reduce the FLOPs and Params of the Transformer (e.g. Swin Transformer). We first discuss the improvements and advantages of moving from VLAD to NetVLAD.Then, the analysis of the structure and computational approach leads us to revisit the principle of self-attention, where we view the attention matrix as a classification problem. To make the classification more efficient, we use 1*1 convolution to implement the classification. Finally, we introduce the structure of conv-attention and theoretically analyze its Computational Cost.
3.1 Classification Problems in NetVLAD
VLAD is a feature aggregation and dimensionality reduction method. It can be expressed by the following Eq. [13]:
where \({x_i}(j)\) is the \({j^{th}}\) dimension of the \({i^{th}}\) local descriptor, \({c_k}\) is the N-dimensional anchor point of the cluster k, \({\alpha _k}({x_i})\) is a symbolic function, if feature \({x_i}\)is close to cluster center \({c_k}\), then \({\alpha _k}({x_i}) = 1\), otherwise \({\alpha _k}({x_i}) = 0\).
The \({\alpha _k}({x_i})\) as a symbolic function is not derivable. The NetVLAD algorithm is proposed to enable backpropagation for end-to-end training. The main improvement is designing a smooth weight function to replace this symbolic function [1]:
Such a change from Eqs. 1 to 2 makes the parameters of VLAD learnable and trainable, and also turning VLAD into a classification problem, i.e., setting there are k classifications, and computing the distribution of differences of local features in these k classifications to obtain the global features V(j, k).
3.2 Conv-Attention
For self-attention, the input is represented as n d-dimensional vectors x, which constitute \(X = ({x_1},{x_2}\ldots ,{x_n})\), and the specific operation can be represented as \(Q = X{W_Q}\), \(K = X{W_K}\), \(V = X{W_V}\) where Q, K, V are denoted as key, query and value, and \({W_Q}\), \({W_K}\), \({W_V}\)represents the fully connected layer parameters. The attention matrix in self-attention is represented as a matrix \(A = soft\max (\frac{{Q{K^T}}}{{\sqrt{{d_k}} }})\) of \(n \times n\), the output can be represented as \({X_i}= \sum \limits _{j = 1}^n {{A_{ij}}{V_j}} \).
In contrast to Eq. 1, the attention matrix \({A_{ij}}\) is obtained similarly to the symbolic function \({\alpha _k}({x_i})\) in VLAD. In VLAD, the symbolic function \({\alpha _k}({x_i})\) is determined based on the relationship between descriptors and clustering centers. In self-attention, the attention matrix \({A_{ij}}\) is determined by the relationship between key and query. Thus, we can also consider query as a descriptor for the vector to be queried, and key as a place to be queried, and the relationship between the two as descriptors and clustering centers. In the subsequent experiments, we also tried to compute the symbolic functions in VLAD in a similar way to the attention matrix, but the results (see Table 2) were not satisfactory.
The calculation of the attention matrix is viewed similarly as a classification problem, and the classification problem is based on the distance between key and query, the closer the distance, the greater the correlation between the two. Softmax function can convert the distance into a probability value, i.e.
where \({k_i}\), \({q_i}\) denotes the key and query, \(\lambda \) is the parameter that controls the distance between \({k_i}\), \({q_i}\). Expand Eq. 3 and eliminate the \({e^{ - \lambda k_i^T{k_i}}}\) to get Eq. 4:
where let \({w_j} = 2\lambda {q_j}\), \({b_j} = - \lambda q_j^T{q_j}\), obtaining the following equation:
Thus, the attention matrix can be obtained by computing the distance between \({k_i}\), \({q_i}\), i.e., viewed as a classification problem implemented with a fully connected layer. \({k_i}\), \({q_i}\) are from the input, the two can be interchanged.
Indeed, this form of attention looks like additive attention, which is given by the formula [18]:
The difference between additive attention and conv-attention(ours) lies in the way of generating the weights of the fully connected matrix. In additive attention, the weight matrix of the key is generated for the key, while in conv-attention, the weight matrix of the key is related to the query. Based on the excellent performance of softmax in self-attention, the softmax function is still retained.
3.3 Model Structure
In practice, to represent a linear structure (like the structure of \(W_j^Tk + {b_j}\)) in a network is usually achieved using a fully connected layer, which is done by expanding the feature map into a one-dimensional vector and multiplying it by a weight vector. Instead of a fully connected layer, the principle of using a \(1*1\) convolution module proposed by NIN [3) that introduce a \(1*1\) convolution into them, where the linear process of self-attention is replaced by three \(1*1\) convolutions [24]. However, such a replacement does not effectively reduce FLOPs, because the computational effort of self-attention is mainly focused on matrix operations. This paper finds that the process of similarity calculation is a classification process and can also be achieved by \(1*1\) convolution, then three convolutions (the convolution of key, query and classification) can be merged into one \(1*1\) convolution. The main correlation of self attention lies in Value. Therefore, Value needs to preserve its semantic features as much as possible. Our such improvement still retains the semantic features in Value, while de-optimising the positional information of the attention. At the same time, such an operation can effectively reduce the computation of the attention matrix.
To better verify the efficiency of our substitution, we summarize the computational complexity in convolution and self-attention in theory (see Table 1), while floating-point operations (FLOPs) and parameters of the model in experiments can be found in the Table 3.
Compared with the experimental results, this paper finds that the theory reduces more values. Through analysis, it is found that the models used in the experiment are based on Swan Transformer. In the calculation process, the amount of calculation in the second part (calculate the Attention Matrix and the value part) is mainly square with the number of partitioning modules M. This part cannot be reduced, which results in the reduction of parameters in the experiment being less than the theoretical value.
4 Experiments
We first try to change the convolution branch in NetVLAD to the way of attention generation (similar to cross-attention [6]). The effect of our object detection model was then tested on the FILR dataset and the Nuscenes dataset. Hardware devices were used on Intel(R) XEON(R) W-2150B CPU @3.00GHz, GPU Using GeForce RTX 3070\(\times \)2, 16 GB memory, all experiments are based on Linux, python 3.10, Pytorch 1.10.
Four typical object detection frameworks: Cascade Mask R-CNN, ATSS, RepPoints v2, and Sparse RCNN are used in Swin Transformer for object detection testing. We focus on vehicle object detection, and we adopt YOLOX [10]and Swin-T as our object detection framework with appropriate adaptations, which can be found in our other work [21]. The framework we used is basically the same as Swin Transformer, and the detection head part is basically the same as YOLOX.
4.1 NetVLAD Related Experiments
We followed NetVLAD’s method on the Pittsburgh (Pitts250k) dataset. Pitts250k contains 250k database images downloaded from Google Street View and 24k test queries generated from Street View but taken at different times, years apart.
Evaluation indicators are used in the standard place recognition evaluation procedure. If at least one of the first N retrieved images is within 25 ms of the real location, the query image is considered to be correctly located. The percentage of correct queries (recall) is then calculated based on different N. The underlying architecture uses VGG-16 and the final convolutional layer is cropped after using ReLU. The number of clustering centers is K = 64 and trains for at most 15 epochs. The rest of the parameters are the same as in the original NetVLAD[22] paper.
Table2 shows the Recall@N performance comparison of the NetVLAD layer with the attention, the NetVALD layer with full connectivity, and the original NetVLAD layer. From the results, the effect of using the attention is about the same as that of directly using full connectivity, indicating that the attention produces a way that is still fully connected in nature, that is, it performs classification, but the effect is not as strong. \(1*1\) convolution can also be seen as fully connected, but it preserves the spatial structure of the feature layer and has better retention for features. We believe that the attention is superior when used superimposed, and the results of NetVLAD based on the attention may be better if the backbone network is replaced with the attention, which is not the focus of this paper. In addition, we find that the robustness of NetVLAD with the attention is even better.
4.2 Conv-Attention Related Experiments
We tested object detection on both the FILR dataset and the Nuscenses dataset. The FILR dataset was acquired by the FLIR Tau2, an vehicle thermal imaging camera, in a driving environment of daytime (60%) and nighttime (40%) between November and May on the streets and highways of Santa Barbara, California.
The Nuscenes dataset is a shared large dataset for autonomous driving developed by the Motional team. The dataset is derived from 1000 driving scenarios collected in Boston and Singapore. Nuscenes contains both daytime and nighttime, and it is a comprehensive dataset containing multiple sensors, of which we select some of the datasets captured by the vision cameras, a total of 23,220 images are selected, of which 18,810 are used as the training set and 2088 are used as the validation set. Specifically, we selected front-view, back-view, and side-view images as our data through the keyframes in the Nuscenes dataset, which effectively avoids too many similar images from influencing the training.
Based on Table 3, we find that the overall number of parameters of the model is reduced by about 15.8% and the GFLOPs are reduced by about 16.0% compared to the corresponding Swin Transformer, and the larger the model the greater the overall reduction in the number of parameters and the greater the percentage of reduction in the GFLOPs, which is conducive to carrying the Transformer’s type model on board. In the subsequent experiments, we used the Swin-T (conv-attention) model.
Based on Table 4, MAP has been improved by 2.46%, which is basically the same as Swin-YOLOX (self-attention) performance, and is significantly improved than some previous methods. we find that the main improvement of the replacement of the backbone network lies in the change of the bicycle value, while the number of bicycles in the dataset is the least, and to make the network learn to distinguish between person and bicyle, while the AP of bicycle of Swin-YOLOX (conv-attention) reaches 62.96%. It shows that using conv-attention can effectively improve the feature extraction ability of the backbone extraction network.
Based on Table 5 and Fig. 4, we find that MAP improved by 9.26%, with the main improvement in car and bicycle, while a 3.27% decrease in AP occurred for bus class. The result means that the attention using the convolutional approach no longer focuses on a part of the object, and pays more attention to the object as a whole. The improvement in the Nuscenes dataset is mainly due to the increase in Recall, compared to the more average improvement in the FILR dataset. The main reason for the different effects is the poor performance of Swin-YOLOX (self-attention) on Recall on Nuscenses, especially on the cycle class, which has a huge increase.
5 Conclusion
In this paper, we construct a new attention method, propose an attention matrix calculated by classification, and apply it to object detection of vehicles, experiments proved that it can replace the original self-attention generation method and optimize the structure of Transformer to facilitate object detection.
References
Arandjelovic R, Gronat P, Torii A et al (2016) Netvlad: CNN architecture for weakly supervised place recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition. IEEE, pp 5297–5307
Bahdanau D, Cho K, Bengio Y (2014) Neural machine translation by jointly learning to align and translate. ar**v preprint ar**v:1409.0473
Bay H, Tuytelaars T, Van Gool L (2006) Surf: speeded up robust features. Lect Notes Comput Sci 3951:404–417
Chan W, Jaitly N, Le Q et al (2016) Listen, attend and spell: a neural network for large vocabulary conversational speech recognition. In: 2016 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, pp 4960–4964
Chaudhari S, Mithal V, Polatkan G et al (2021) An attentive survey of attention models. ACM Trans Intell Syst Technol 12(5):1–32
Chen CFR, Fan Q, Panda R (2021) Crossvit: cross-attention multi-scale vision transformer for image classification. In: Proceedings of the IEEE/CVF international conference on computer vision. IEEE, pp 357–366
Chu X, Tian Z, Wang Y et al (2021) Twins: revisiting the design of spatial attention in vision transformers. Adv Neural Inf Process Syst 34:9355–9366
Dalal N, Triggs B (2005) Histograms of oriented gradients for human detection. In: 2005 IEEE computer society conference on computer vision and pattern recognition (CVPR’05). IEEE, pp 886–893
Dosovitskiy A, Beyer L, Kolesnikov A et al (2020) An image is worth 16x16 words: Transformers for image recognition at scale. ar**v preprint ar**v:2010.11929
Ge Z, Liu S, Wang F, et al (2021) Yolox: exceeding yolo series in 2021. ar**v preprint ar**v:2107.08430
Hou Q, Zhou D, Feng J (2021) Coordinate attention for efficient mobile network design. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. IEEE, pp 13713–13722
Hu J, Shen L, Sun G (2018) Squeeze-and-excitation networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition. IEEE, pp 7132–7141
Jégou H, Douze M, Schmid C, et al (2010) Aggregating local descriptors into a compact image representation. In: 2010 IEEE computer society conference on computer vision and pattern recognition. IEEE, pp 3304–3311
Keles FD, Wijewardena PM, Hegde C (2023) On the computational complexity of self-attention. In: International conference on algorithmic learning theory. PMLR, pp 597–619
Khaliq A, Milford M, Garg S (2022) Multires-netvlad: augmenting place recognition training with low-resolution imagery. IEEE Robot Autom Lett 7(2):3882–3889
Kitaev N, Kaiser Ł, Levskaya A (2020) Reformer: the efficient transformer. ar**v preprint ar**v:2001.04451
Lee-Thorp J, Ainslie J, Eckstein I et al (2021) Fnet: mixing tokens with fourier transforms. ar**v preprint ar**v:2105.03824
Li Y, Kaiser L, Bengio S et al (2019) Area attention. In: International conference on machine learning. PMLR, pp 3846–3855
Lin M, Chen Q, Yan S (2013) Network in network. ar**v preprint ar**v:1312.4400
Liu Z, Lin Y, Cao Y et al (2021) Swin transformer: hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF international conference on computer vision. IEEE, pp 10012–10022
Lou Z, Luo S (2022) Vehicle infrared target detection based on Yolox and Swin transformer. Infrared Technol 44(11):9
Lowe DG (2004) Distinctive image features from scale-invariant keypoints. Int J Comput Vision 60:91–110
Oliva A, Torralba A (2006) Building the gist of a scene: the role of global image features in recognition. Prog Brain Res 155:23–36
Pan X, Ge C, Lu R et al (2022) On the integration of self-attention and convolution. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. IEEE, pp 815–825
Perronnin F, Dance C (2007) Fisher kernels on visual vocabularies for image categorization. In: 2007 IEEE conference on computer vision and pattern recognition. IEEE, pp 1–8
Philbin J, Chum O, Isard M et al (2007) Object retrieval with large vocabularies and fast spatial matching. In: 2007 IEEE conference on computer vision and pattern recognition. IEEE, pp 1–8
Roy A, Saffar M, Vaswani A et al (2021) Efficient content-based sparse attention with routing transformers. Trans Assoc Comput Linguist 9:53–68
Shen T, Zhou T, Long G et al (2017) Disan: directional self-attention network for rnn/cnn-free language understanding. ar**v preprint ar**v:1709.04696
Touvron H, Cord M, Douze M et al (2021) Training data-efficient image transformers & distillation through attention. In: International conference on machine learning. PMLR, pp 10347–10357
Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Advances in neural information processing systems 30
Wang W, **e E, Li X et al (2021) Pyramid vision transformer: a versatile backbone for dense prediction without convolutions. In: Proceedings of the IEEE/CVF international conference on computer vision. IEEE, pp 568–578
Woo S, Park J, Lee JY et al (2018) Cbam: Convolutional block attention module. In: Proceedings of the European conference on computer vision (ECCV). IEEE, pp 3–19
**e E, Wang W, Yu Z et al (2021) Segformer: simple and efficient design for semantic segmentation with transformers. Adv Neural Inf Process Syst 34:12077–12090
Zhu L, Wang X, Ke Z et al (2023) Biformer: vision transformer with bi-level routing attention. In: 2023 IEEE/CVF conference on computer vision and pattern recognition (CVPR). IEEE, pp 10323–10333
Author information
Authors and Affiliations
Contributions
Lou zhehang wrote the manuscript, Luo suyun reviewed and revised the manuscript, Huang xiaoci performed project administration, Wei dan performed funding acquisition. All authors reviewed the manuscript.
Corresponding author
Ethics declarations
Conflict of interest
This work was supported by National Science Foundation of China under Grant 62101314.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Lou, Z., Luo, S., Huang, X. et al. Conv-Attention: A Low Computation Attention Calculation Method for Swin Transformer. Neural Process Lett 56, 71 (2024). https://doi.org/10.1007/s11063-024-11483-6
Accepted:
Published:
DOI: https://doi.org/10.1007/s11063-024-11483-6