1 Introduction

People acquire a large amount of information in their daily life, like visual information, sound information, tactile information, etc, while process these large amounts of information quickly, efficiently and filter out the needed information. The human visual processing system tends to focus on the main part of the image and ignore irrelevant information, which is the attention mechanism of the human biological system [4, 5, 28]. Attention mechanism was first introduced by Bahdanau et al. [2] to calculate the weight of input data to highlight the degree of influence of a certain input on the output (Fig. 1). Today, attention mechanism is widely used in various areas of artificial intelligence [9, 20, 33], such as Natural Language Processing (NLP), Speech Recognition and Computer Vision (CV).

Fig. 1
figure 1

Different forms of attention, including dot product (left), add (middle), and conv (right). Our proposed conv-attention mainly corresponds to key and query one-to-one for classification

In computer vision, the previous attention mechanism is mainly based on convolutional neural networks, such as SENet [12], Coordinate Attention [11], CBAM [32], etc. However, transformer introduced self-attention, a more efficient, scalable and domain-independent architecture. ViT [9], a large Transformer based model that uses self-attention directly on image blocks, is the first major job which outperforms the traditional CNN model. Most of the Transformer [30] are structured into encoder and decoder, convert the input data into vectors, following operations such as multi-head self-attention, normalization, and fully connected. Such a structure gives Transformer great flexibility in vision tasks.

Self-attention is the most important part in Transformer, and its operation is proportional to the square of the sequence length. When the length of the sequence is too long, the attention matrix is very troublesome to calculate. Hence, improvements have been made to reduce computational effort in the field of NLP, such as local attention, which computes only attention of adjacent elements. Reformer [16] and Routing Transformer [27] use clustering to reduce the computational effort. FNet [17] replaces the computation of self-attention with Fourier Transform. In Computer Vision, since pictures have higher pixel resolution than text, longer sequences are needed to represent them [14]. Swin Transformer’s [20] solution is to use a sliding window so that attention is computed only within the window, which is less computationally intensive. In the Computer Vision field, there is also necessity for methods to reduce the amount of computation.

In practical engineering applications, the Yolo family of algorithms is often used for vehicle object detection. Although Transformers has better performance, it is difficult to be deployed directly to vehicles due to its large number of parameters and computation. NetVLAD [1], however, is widely used in the field of computer vision due to its excellent computing speed. In this paper, we learn ideas from NetVLAD to improve the performance of Transformer for in-vehicle deployment.

During our research, we found that the main role of NetVLAD is clustering and dimensionality reduction, and the main purpose of the attention mechanism is to reduce the interference of useless information. Main roles of both NetVLAD and self-attention are extracting the main components. In self-attention, the attention matrix is generated using the calculation of similarity, which is similar to the way of classification aggregation, i.e., given two vectors, the probability that an element in each vector belongs to the other vector is calculated. Based on the similarity of the two computational approaches and principles, we propose conv-attention, which combines the efficiency of NetVLAD and the completeness of self-attention. By using the conv-attention in Swin Transformer, we found that the number of parameters can be effectively reduced and the accuracy in vehicle detection is also improved. The work in this paper is mainly as follows:

  1. 1.

    We reveal the principle of attention matrix generation in self-attention by comparing with NetVLAD. And we optimize the calculation of self-attention to avoid too many matrix operations.

  2. 2.

    We have proposed a new method of attention calculation which is called conv-attention, and apply it in Swin Transformer. FLOPs can be reduced by an average of 15.8%, and Params can be reduced by an average of 16%.

  3. 3.

    The Extension experiment shows that when YOLOX is used as the detection head, the Swin Transformer with conv attention performs better in vehicle target detection. On the infrared FILR dataset, the MAP is improved by 2.46%. On the autopilot Nuscenses dataset, MAP is improved by 9.26%.

2 Related Work

NetVLAD In computer vision, visual location recognition is a challenging task. In early work, bag-of-words (BoW [26]) based models are widely used, which rely on hand-craft features (SIFT [22], SURF [3], HoG [8], Gist [23]). And along with the development of deep learning, Fisher Vector (FV [25]) and VLAD became powerful alternatives to hand-craft features.

VLAD [13] is called “Vector of Locally Aggregated Descriptors”, as a simplified version of FV, calculates the residuals of descriptors and clustering centers. The feature vectors of the whole image are obtained for weighted summation. NetVLAD [1] was proposed by Relja et al. its most important feature is that it can be reverse propagation training. Also, it can be inserted as a convolutional layer into any feature extraction network as a module. More and more algorithms [15] based on NetVLAD are proposed.

Vision Transformer Transformer was first used for machine translation tasks and has been used in the field of NLP, which has multi-head self-attention at its core. In 2020, Transformer [30] was introduced to computer vision, ViT [7] revisits the design of spatial attention and proposes spatially separable self-attention. In Twins, attention is grouped by spatial dimensions, and local self-attention is calculated separately and then fused. Here, none of these methods attempt to tune self-attention, but simply optimize the computation in a way similar to employing local attention.

Attentional Mechanism In computer vision, attention mechanism provides great benefits, which mainly include channel attention, spatial attention, temporal attention, and branch attention, and some combinations of them. Self-attention originated in the field of NLP for sequence-to-sequence tasks, which allows machines to better comprehend the context and meaning of a given sentence or document. The Attention matrix (or Alignment functions) of the attention mechanism has evolved along with it in various variants.

At the same time, due to the large computational effort of the attention matrix, there are many works done to optimize the computation. Reformer [16] replaced the dot product attention in the transformer with local position-sensitive hash attention. In Routing Transformer [27], the sparse attention matrix was computed by clustering the key and query using small-batch K-means clustering based on low-rank sparse patterns that rely on K-means. FNet [17] uses Fourier Transform instead of attention generation and proposes attention may not be the principal component driving the performance of Transformers.

All of these methods reduce the computational effort by optimizing the computation, but none of them avoids the complexity caused by matrix operations. In this paper, the proposed approach regards attention as a classification problem and uses convolution to obtain the attention matrix directly, which will effectively reduce the computational effort.

3 Method

In order to effectively reduce the FLOPs and Params of the Transformer (e.g. Swin Transformer). We first discuss the improvements and advantages of moving from VLAD to NetVLAD.Then, the analysis of the structure and computational approach leads us to revisit the principle of self-attention, where we view the attention matrix as a classification problem. To make the classification more efficient, we use 1*1 convolution to implement the classification. Finally, we introduce the structure of conv-attention and theoretically analyze its Computational Cost.

3.1 Classification Problems in NetVLAD

VLAD is a feature aggregation and dimensionality reduction method. It can be expressed by the following Eq. [13]:

$$\begin{aligned} V(j,k) = \sum \limits _{i = 1}^N {{\alpha _k}({x_i})({x_i}(j) - {c_k}(j))} \end{aligned}$$
(1)

where \({x_i}(j)\) is the \({j^{th}}\) dimension of the \({i^{th}}\) local descriptor, \({c_k}\) is the N-dimensional anchor point of the cluster k, \({\alpha _k}({x_i})\) is a symbolic function, if feature \({x_i}\)is close to cluster center \({c_k}\), then \({\alpha _k}({x_i}) = 1\), otherwise \({\alpha _k}({x_i}) = 0\).

The \({\alpha _k}({x_i})\) as a symbolic function is not derivable. The NetVLAD algorithm is proposed to enable backpropagation for end-to-end training. The main improvement is designing a smooth weight function to replace this symbolic function [1]:

$$\begin{aligned} V(j, k)=\sum _{i=1}^{N} \frac{e^{w_{k}^{T} x_{i}+b_{k}}}{\sum _{k^{\prime }} e^{w_{k}^{T} x_{i}+b_{k^{\prime }}}}\left( x_{i}(j)-c_{k}(j)\right) \end{aligned}$$
(2)

Such a change from Eqs. 1 to 2 makes the parameters of VLAD learnable and trainable, and also turning VLAD into a classification problem, i.e., setting there are k classifications, and computing the distribution of differences of local features in these k classifications to obtain the global features V(jk).

3.2 Conv-Attention

For self-attention, the input is represented as n d-dimensional vectors x, which constitute \(X = ({x_1},{x_2}\ldots ,{x_n})\), and the specific operation can be represented as \(Q = X{W_Q}\), \(K = X{W_K}\), \(V = X{W_V}\) where Q, K, V are denoted as key, query and value, and \({W_Q}\), \({W_K}\), \({W_V}\)represents the fully connected layer parameters. The attention matrix in self-attention is represented as a matrix \(A = soft\max (\frac{{Q{K^T}}}{{\sqrt{{d_k}} }})\) of \(n \times n\), the output can be represented as \({X_i}= \sum \limits _{j = 1}^n {{A_{ij}}{V_j}} \).

In contrast to Eq. 1, the attention matrix \({A_{ij}}\) is obtained similarly to the symbolic function \({\alpha _k}({x_i})\) in VLAD. In VLAD, the symbolic function \({\alpha _k}({x_i})\) is determined based on the relationship between descriptors and clustering centers. In self-attention, the attention matrix \({A_{ij}}\) is determined by the relationship between key and query. Thus, we can also consider query as a descriptor for the vector to be queried, and key as a place to be queried, and the relationship between the two as descriptors and clustering centers. In the subsequent experiments, we also tried to compute the symbolic functions in VLAD in a similar way to the attention matrix, but the results (see Table 2) were not satisfactory.

The calculation of the attention matrix is viewed similarly as a classification problem, and the classification problem is based on the distance between key and query, the closer the distance, the greater the correlation between the two. Softmax function can convert the distance into a probability value, i.e.

$$\begin{aligned} {A_{ij}} = \frac{{{e^{ - \lambda {{\left\| {{k_i} - {q_j}} \right\| }^2}}}}}{{\sum \limits _{j'} {{e^{ - \lambda {{\left\| {{k_i} - {q_{j'}}} \right\| }^2}}}} }} \end{aligned}$$
(3)

where \({k_i}\), \({q_i}\) denotes the key and query, \(\lambda \) is the parameter that controls the distance between \({k_i}\), \({q_i}\). Expand Eq. 3 and eliminate the \({e^{ - \lambda k_i^T{k_i}}}\) to get Eq. 4:

$$\begin{aligned} {A_{ij}} = \frac{{{e^{2\lambda q_j^T{k_i} - \lambda q_j^T{q_j}}}}}{{\sum \limits _{j'} {{e^{2\lambda q_{j'}^T{k_i} - \lambda q_j^T{q_{j'}}}}} }} \end{aligned}$$
(4)

where let \({w_j} = 2\lambda {q_j}\), \({b_j} = - \lambda q_j^T{q_j}\), obtaining the following equation:

$$\begin{aligned}{} & {} {A_{ij}} = \frac{{{e^{w_j^T{k_i} + {b_j}}}}}{{\sum \limits _{j'} {{e^{w_{j'}^T{k_i} + {b_{j'}}}}} }} \end{aligned}$$
(5)
$$\begin{aligned}{} & {} A = soft\max (W_j^Tk + {b_j}) \end{aligned}$$
(6)

Thus, the attention matrix can be obtained by computing the distance between \({k_i}\), \({q_i}\), i.e., viewed as a classification problem implemented with a fully connected layer. \({k_i}\), \({q_i}\) are from the input, the two can be interchanged.

Indeed, this form of attention looks like additive attention, which is given by the formula [18]:

$$\begin{aligned} A = \tanh ({W_k}k + {W_q}q + b) \end{aligned}$$
(7)

The difference between additive attention and conv-attention(ours) lies in the way of generating the weights of the fully connected matrix. In additive attention, the weight matrix of the key is generated for the key, while in conv-attention, the weight matrix of the key is related to the query. Based on the excellent performance of softmax in self-attention, the softmax function is still retained.

3.3 Model Structure

In practice, to represent a linear structure (like the structure of \(W_j^Tk + {b_j}\)) in a network is usually achieved using a fully connected layer, which is done by expanding the feature map into a one-dimensional vector and multiplying it by a weight vector. Instead of a fully connected layer, the principle of using a \(1*1\) convolution module proposed by NIN [3) that introduce a \(1*1\) convolution into them, where the linear process of self-attention is replaced by three \(1*1\) convolutions [24]. However, such a replacement does not effectively reduce FLOPs, because the computational effort of self-attention is mainly focused on matrix operations. This paper finds that the process of similarity calculation is a classification process and can also be achieved by \(1*1\) convolution, then three convolutions (the convolution of key, query and classification) can be merged into one \(1*1\) convolution. The main correlation of self attention lies in Value. Therefore, Value needs to preserve its semantic features as much as possible. Our such improvement still retains the semantic features in Value, while de-optimising the positional information of the attention. At the same time, such an operation can effectively reduce the computation of the attention matrix.

Fig. 3
figure 3

Other attention using \(1*1\) convolution. Only use \(1*1\) convolution to replace the process of generating query, key, and value

To better verify the efficiency of our substitution, we summarize the computational complexity in convolution and self-attention in theory (see Table 1), while floating-point operations (FLOPs) and parameters of the model in experiments can be found in the Table 3.

Compared with the experimental results, this paper finds that the theory reduces more values. Through analysis, it is found that the models used in the experiment are based on Swan Transformer. In the calculation process, the amount of calculation in the second part (calculate the Attention Matrix and the value part) is mainly square with the number of partitioning modules M. This part cannot be reduced, which results in the reduction of parameters in the experiment being less than the theoretical value.

Table 1 Theoretical FLOPs and params in different module

4 Experiments

We first try to change the convolution branch in NetVLAD to the way of attention generation (similar to cross-attention [6]). The effect of our object detection model was then tested on the FILR dataset and the Nuscenes dataset. Hardware devices were used on Intel(R) XEON(R) W-2150B CPU @3.00GHz, GPU Using GeForce RTX 3070\(\times \)2, 16 GB memory, all experiments are based on Linux, python 3.10, Pytorch 1.10.

Four typical object detection frameworks: Cascade Mask R-CNN, ATSS, RepPoints v2, and Sparse RCNN are used in Swin Transformer for object detection testing. We focus on vehicle object detection, and we adopt YOLOX [10]and Swin-T as our object detection framework with appropriate adaptations, which can be found in our other work [21]. The framework we used is basically the same as Swin Transformer, and the detection head part is basically the same as YOLOX.

4.1 NetVLAD Related Experiments

We followed NetVLAD’s method on the Pittsburgh (Pitts250k) dataset. Pitts250k contains 250k database images downloaded from Google Street View and 24k test queries generated from Street View but taken at different times, years apart.

Evaluation indicators are used in the standard place recognition evaluation procedure. If at least one of the first N retrieved images is within 25 ms of the real location, the query image is considered to be correctly located. The percentage of correct queries (recall) is then calculated based on different N. The underlying architecture uses VGG-16 and the final convolutional layer is cropped after using ReLU. The number of clustering centers is K = 64 and trains for at most 15 epochs. The rest of the parameters are the same as in the original NetVLAD[22] paper.

Table 2 Results of attempt to employ self-attention in NetVLAD

Table2 shows the Recall@N performance comparison of the NetVLAD layer with the attention, the NetVALD layer with full connectivity, and the original NetVLAD layer. From the results, the effect of using the attention is about the same as that of directly using full connectivity, indicating that the attention produces a way that is still fully connected in nature, that is, it performs classification, but the effect is not as strong. \(1*1\) convolution can also be seen as fully connected, but it preserves the spatial structure of the feature layer and has better retention for features. We believe that the attention is superior when used superimposed, and the results of NetVLAD based on the attention may be better if the backbone network is replaced with the attention, which is not the focus of this paper. In addition, we find that the robustness of NetVLAD with the attention is even better.

4.2 Conv-Attention Related Experiments

We tested object detection on both the FILR dataset and the Nuscenses dataset. The FILR dataset was acquired by the FLIR Tau2, an vehicle thermal imaging camera, in a driving environment of daytime (60%) and nighttime (40%) between November and May on the streets and highways of Santa Barbara, California.

The Nuscenes dataset is a shared large dataset for autonomous driving developed by the Motional team. The dataset is derived from 1000 driving scenarios collected in Boston and Singapore. Nuscenes contains both daytime and nighttime, and it is a comprehensive dataset containing multiple sensors, of which we select some of the datasets captured by the vision cameras, a total of 23,220 images are selected, of which 18,810 are used as the training set and 2088 are used as the validation set. Specifically, we selected front-view, back-view, and side-view images as our data through the keyframes in the Nuscenes dataset, which effectively avoids too many similar images from influencing the training.

Table 3 FLOPs and params for networks with different attention approaches (BiFormer from [34])

Based on Table 3, we find that the overall number of parameters of the model is reduced by about 15.8% and the GFLOPs are reduced by about 16.0% compared to the corresponding Swin Transformer, and the larger the model the greater the overall reduction in the number of parameters and the greater the percentage of reduction in the GFLOPs, which is conducive to carrying the Transformer’s type model on board. In the subsequent experiments, we used the Swin-T (conv-attention) model.

Table 4 Performance of our model and the original Swin on the FILR dataset

Based on Table 4, MAP has been improved by 2.46%, which is basically the same as Swin-YOLOX (self-attention) performance, and is significantly improved than some previous methods. we find that the main improvement of the replacement of the backbone network lies in the change of the bicycle value, while the number of bicycles in the dataset is the least, and to make the network learn to distinguish between person and bicyle, while the AP of bicycle of Swin-YOLOX (conv-attention) reaches 62.96%. It shows that using conv-attention can effectively improve the feature extraction ability of the backbone extraction network.

Table 5 Performance of our model and the original Swin on the Nuscenes dataset
Fig. 4
figure 4

Precision and Recall for each category on the Nuscenes dataset

Based on Table 5 and Fig. 4, we find that MAP improved by 9.26%, with the main improvement in car and bicycle, while a 3.27% decrease in AP occurred for bus class. The result means that the attention using the convolutional approach no longer focuses on a part of the object, and pays more attention to the object as a whole. The improvement in the Nuscenes dataset is mainly due to the increase in Recall, compared to the more average improvement in the FILR dataset. The main reason for the different effects is the poor performance of Swin-YOLOX (self-attention) on Recall on Nuscenses, especially on the cycle class, which has a huge increase.

5 Conclusion

In this paper, we construct a new attention method, propose an attention matrix calculated by classification, and apply it to object detection of vehicles, experiments proved that it can replace the original self-attention generation method and optimize the structure of Transformer to facilitate object detection.