Convolutional MLP orthogonal fusion of multiscale features for visual place recognition

Gan, Wenjian; Zhou, Yang; Hu, **aofei; Zhao, Luying; Huang, Gaoshuang; Zhang, Chenglong

doi:10.1038/s41598-024-62749-x

Convolutional MLP orthogonal fusion of multiscale features for visual place recognition

Article
Open access
Published: 23 May 2024

Volume 14, article number 11756, (2024)
Cite this article

Download PDF

You have full access to this open access article

Scientific Reports

Convolutional MLP orthogonal fusion of multiscale features for visual place recognition

Download PDF

Wenjian Gan¹,
Yang Zhou¹,
**aofei Hu¹,
Luying Zhao¹,
Gaoshuang Huang¹ &
…
Chenglong Zhang¹

326 Accesses
2 Altmetric
Explore all metrics

Abstract

Visual place recognition (VPR) involves obtaining robust image descriptors to cope with differences in camera viewpoints and drastic external environment changes. Utilizing multiscale features improves the robustness of image descriptors; however, existing methods neither exploit the multiscale features generated during feature extraction nor consider the feature redundancy problem when fusing multiscale information when image descriptors are enhanced. We propose a novel encoding strategy—convolutional multilayer perceptron orthogonal fusion of multiscale features (ConvMLP-OFMS)—for VPR. A ConvMLP is used to obtain robust and generalized global image descriptors and the multiscale features generated during feature extraction are used to enhance the global descriptors to cope with changes in the environment and viewpoints. Additionally, an attention mechanism is used to eliminate noise and redundant information. Compared to traditional methods that use tensor splicing for feature fusion, we introduced matrix orthogonal decomposition to eliminate redundant information. Experiments demonstrated that the proposed architecture outperformed NetVLAD, CosPlace, ConvAP, and other methods. On the Pittsburgh and MSLS datasets, which contained significant viewpoint and illumination variations, our method achieved 92.5% and 86.5% Recall@1, respectively. We also achieved good performances—80.6% and 43.2%—on the SPED and NordLand datasets, respectively, which have more extreme illumination and appearance variations.

LocalSPED: A Classification Pipeline that Can Learn Local Features for Place Recognition Using a Small Training Set

MLIFeat: Multi-level Information Fusion Based Deep Local Features

Simple ConvNet Based on Bag of MLP-Based Local Descriptors

Introduction

The process of recognizing and obtaining the geographic location of a given query image in a pre-built image database is known as visual place recognition (VPR), or visual geo-localization (VG). Large-scale image geo-localization is often regarded as an image retrieval task^1,2. VPR is crucial in many robotics and computer vision tasks, such as autonomous driving³, 3D reconstruction⁴, and unmanned aerial vehicle (UAV) localization in Global Navigation Satellite System (GNSS)-denied environments⁵. The challenges of VPR mainly arise from changing external environments, such as different seasons, different illumination, occlusion, and moving objects⁶; environments with high appearance similarities, such as trees and buildings⁷; and differences in camera viewpoints¹. Therefore, researchers are interested in obtaining feature descriptors that are robust and generalizable to image changes.

Traditional methods use scale-invariant feature transform (SIFT)⁸, and histogram of orientated gradients (HOG)⁹ to obtain local feature descriptors, most current studies^{10,11,12,13,14} insert end-to-end trainable layers into pre-trained feature extraction networks to obtain robust global descriptor representations. Subsequent studies have enhanced the robustness and generalization of image descriptors for better localization by attaching semantic and contextual information^15,16, using attention mechanisms^10,17, and exploiting multiscale features^13,18. Some studies have also achieved more efficient of image retrieval by using a combination of deep features and handcrafted features¹⁹.

Recent research has concentrated on improving the robustness of image representations by introducing multiscale information in the global image descriptors^20,21,22. However, these convolutional neural network (CNN)-based methods either learn multiscale information by constructing image pyramids^18,21 or extract multiscale features using convolutional kernels of different sizes and dilation coefficients at the last convolutional layer of the model^10,15,20. These methods ignore the problem of information loss caused by constant downsampling during multiscale feature extraction and the problem of information redundancy when fusing multiscale features.

In this study, we propose a novel feature extraction architecture for the convolutional multilayer perceptron (MLP) orthogonal fusion of multiscale features (ConvMLP-OFMS) to address these problems. While using a convolutional architecture to form a multilayer perceptron to extract more discriminative and robust feature representations, this method makes full use of the scale information generated by feature extraction and eliminates redundant information relative to global descriptors in the scale information through feature projection decomposition. Our approach achieved optimal performance on several VPR benchmark datasets, such as Pittsburgh, Nordland, and MSLS. Our contributions can be summarized as follows:

We propose a novel MLP-based encoding strategy called ConvMLP, which achieves efficient feature aggregation using a simple architecture and exhibits excellent VPR performance.
We propose a VPR architecture for orthogonal fusion of multiscale features and demonstrate the effectiveness of multiscale features generated during feature extraction.
We eliminate noise and redundant scale information by enhancing spatial attention and orthogonal projection decomposition to utilize multiscale information more efficiently.

Related work

Visual place recognition

VPR has always been considered an image-retrieval problem², in which the location of a query image is determined according to the known geographical labels of the image in a reference database. Traditional VPR methods use handcrafted feature extraction operators, such as SIFT⁸, and HOG⁹, to obtain the local features of an image. Then, the bag of words (BoW)²³ and vector of locally aggregated descriptors (VLAD)²⁴ are further used to aggregate them into a global descriptor representing the entire image to reduce the computation and storage overhead caused by descriptor dimensions. With the rapid development of computer technology, deep learning methods such as CNN and transformers have achieved excellent results in many computer vision tasks, such as image classification^25,25 object detection²⁶, and image segmentation²⁷. Several researchers have used CNNs and vision transformers³⁸ proposed an image classification method that improves accuracy by utilizing MLPs to unify the advantages of CNNs with different architectures. These MLP-based model demonstrated that a simple feedforward neural network can achieve performance similar to that of convolutional operations and self-attention in image classification.

Encoding spatial information using an MLP requires that the input dimensions be fixed. However, the resolution of the input image is typically not fixed. Although this problem can be solved by resizing the input image, this method inevitably results in the loss of image information, which affects the final retrieval and localization results.

In this study, owing to the similarities between VPR and image classification³⁹, we propose a novel and efficient feature-aggregation technique called ConvMLP based on the aforementioned MLP-related studies. We used a 1 × 1 convolution to form an MLP that highlights the channel information and adaptive average pooling to strengthen the spatial information in obtaining global features while avoiding information loss due to image resizing.

Leveraging multiscale information

Several studies have shown that multiscale information can be utilized to effectively improve performance of a model in the VPR task^{$F \in R^{c \times h \times w}$, where $h \times w$ is the feature map size and $c$ is the number of feature map channels.}

A set of 1 × 1 convolutions was used to form a multilayer perceptron, and for the feature map $F \in R^{c \times h \times w}$ acquired by the backbone network, feature aggregation was achieved by stacking multiple layers of the ConvMLP. The expression for ConvMLP is as follows:

$$\begin{array}{*{20}c} {ConvMLP\left( F \right) = F + W_{2} \left( {\sigma \left( {BN\left( {W_{1} \left( F \right)} \right)} \right)} \right)} \\ \end{array}$$

(1)

where W₁ and W₂ represent the 1 × 1 convolution, $\sigma$ is the ReLU nonlinear activation, and BN represents batch normalization; inductive bias was not used.

Thereafter, the spatial dimension of the feature map was transformed from $h \times w$ to 1 × 1 using adaptive average pooling. Finally, the dimensionality-decreased feature map was flattened and $L_{2}$ normalized to obtain the global descriptor $f_{g} \in R^{c \times 1}$ used to represent the entire image. In this study, the actual dimension of $f_{g}$ was 1024 since the feature map from ResNet50 is used as input. Figure 2 illustrates the basic flow. The proposed method can be expressed as follows:

$$\begin{array}{*{20}c} {f_{g} = AAP\left( {ConvMLP_{D} \left( {ConvMLP_{D - 1} \left( { \cdot \cdot \cdot ConvMLP_{1} \left( F \right)} \right)} \right)} \right)} \\ \end{array}$$

(2)

where $AAP$ represents adaptive average pooling, $ConvMLP$ represents the convolutional multilayer perceptron, $F \in R^{c \times h \times w}$ represents the feature map obtained using ResNet50, and $D$ represents the depth of the ConvMLP.

In summary, instead of focusing on local features or using the attention mechanism, the 1 × 1 convolutional property was used to aggregate channel features, and adaptive mean pooling was used to aggregate spatial features. Based on MLP research, a feature-aggregation method called ConvMLP is proposed, which attempts to avoid excessive parameter count and computation while fully performing feature aggregation.

Multiscale feature orthogonal fusion

A common approach used in VPR is spatial pyramid pooling or similar structures to obtain spatial features at different scales. However, as stated in section “Leveraging multiscale information”, this method is only applied to the feature map extracted from the last convolutional layer of the model and does not consider the multiscale information generated during feature extraction; therefore, we use the features at different scales generated during feature extraction for feature fusion.

Considering that feature extraction often produces shallow features that contain more noise, we propose an ESA module based on the method propose by Woo et al.⁴⁵, as shown in Fig. 3. This module can embed spatial layout information into the feature representation, such that the network focuses on regions that are valuable for VPR and suppresses other irrelevant objects and noise.

The ESA module can be represented as follows:

$$\begin{aligned} S\left( F \right) & = f_{max}^{c} \left( F \right) \\ M\left( F \right) & = \delta \left( {f^{1 \times 1} \left( {f^{3 \times 3} \left( {S\left( F \right)} \right) \cup f^{5 \times 5} \left( {S\left( F \right)} \right) \cup f^{7 \times 7} \left( {S\left( F \right)} \right)} \right)} \right) \\ F^{\prime} & = M\left( F \right) \otimes F \\ \end{aligned}$$

(3)

For a feature map $F \in R^{c \times h \times w}$, we first used max pooling along the channel to highlight the more valuable features. Thereafter, convolution kernels with different receptive fields were used to focus on more scale information. The three feature maps containing different scales were spliced along the channel direction, and the attention scores of the input features were obtained using a 1 × 1 convolution and sigmoid function. Finally, the attention scores were broadcast onto the input feature map to obtain the spatial attention-weighted features. The feature maps obtained from the first three convolutional stages of ResNet50 were spatially attention-weighted and then fused in the channel direction to obtain multiscale features $f_{s} \in R^{c \times h \times w}$. To facilitate the orthogonal decomposition in the next step, we performed a dimensional transformation of the feature map obtained from the first convolutional layer of ResNet50 using a normal convolution, thus kee** the dimensions of $f_{s}$ and $f_{g}$ equal. Thus, the channel dimension of $f_{s}$ was also 1024, but the spatial information was preserved, and its actual dimensions were 1024 × h × w.

Traditional multiscale feature fusion is usually realized by tensor splicing; however, this method does not consider the repetition and redundancy between different descriptors, which will have an impact on retrieval accuracy. Therefore, we adopted a method of feature orthogonal projection fusion^46,47, by which orthogonal projection decomposition can eliminate redundant scale information. Thus, the scale and global information obtained can be enhanced to generate more compact image descriptors.

The basic principle of feature orthogonal projection fusion is shown in Fig. 4. It requires the global feature $f_{g}$ and multiscale feature $f_{s}$ as inputs and calculates the projection $f_{s,proj}^{{\left( {i,j} \right)}}$ of each multiscale feature $f_{s}^{{\left( {i,j} \right)}}$ on the global feature $f_{g}$ pixel-by-pixel, as follows:

$$\begin{array}{*{20}c} {f_{s,proj}^{{\left( {i,j} \right)}} = \frac{{f_{s}^{{\left( {i,j} \right)}} \cdot f_{g} }}{{\left| {f_{g} } \right|^{2} }}f_{g} } \\ \end{array}$$

(4)

The corresponding orthogonal component $f_{s,orth}^{{\left( {i,j} \right)}}$ was then obtained by computing the difference between the multiscale feature $f_{s}^{{\left( {i,j} \right)}}$ and its projection vector $f_{s,proj}^{{\left( {i,j} \right)}}$, as follows:

$$\begin{array}{*{20}c} {f_{s,orth}^{{\left( {i,j} \right)}} = f_{s}^{{\left( {i,j} \right)}} - f_{s,proj}^{{\left( {i,j} \right)}} } \\ \end{array}$$

(5)

Next, for the orthogonal component $f_{s,orth}^{{\left( {i,j} \right)}} \in R^{c \times h \times w}$, following the previous method for obtaining global features, the feature map of the orthogonal components was aggregated into $c \times 1$ orthogonal descriptors $f_{orth}$ using adaptive average pooling and $L_{2}$ regularization. Finally, $f_{orth}$ and $f_{g}$, which had identical dimensions, were concatenated in the channel dimension to obtain the mixed descriptor $f_{m}$ for image retrieval localization.

Multi-similarity loss

Existing VPR methods^11,13,18 typically use triplet loss⁴⁸ as the loss function of the model, which achieves weakly supervised training by mining the intraclass positive samples $p$ corresponding to the anchor samples $a$ and interclass negative samples $n$. For VPR, $a$ is typically a single-query image, and $p$ and $n$ are typically determined from the ground truth of the images in the reference image database.

$$\begin{aligned} {\mathscr{L}}_{{triplet{ - }loss}} & = \mathop \sum \limits_{i}^{N} \left[ {\left\| {f\left( {x_{i}^{a} } \right) - f\left( {x_{i}^{p} } \right)} \right\|_{2}^{2} - \left\| {f\left( {x_{i}^{a} } \right) - f\left( {x_{i}^{n} } \right)} \right\|_{2}^{2} + \alpha } \right]_{ + } \\ & = \max \left( {D\left( {a,p} \right) - D\left( {a,n} \right) + \alpha ,0} \right) \\ \end{aligned}$$

(6)

Triplet loss uses the Euclidean distance as a metric; + indicates that the value in parentheses is taken as the loss value when the value is greater than zero, and the loss value is taken as zero when it is less than zero. $\alpha$ is a threshold set to help the model learn, its value is empirically determined and is usually set to 0.1. To improve the generalization ability of the model, it is common to select positive samples $p$ with negative samples $n$, such that $D\left( {a,n} \right) < D\left( {a,p} \right)$, which is also known as the hard negative-sample selection strategy.

We chose the multi-similarity loss function⁴⁹ for training, which has been shown to exhibit the best performance in VPR. Multi-similarity loss mitigates the problem of excessively large interclass distances and excessively small intraclass distances in metric learning by considering multiple similarities. Instead of using absolute spatial distances as the only metric, it uses the overall distance distribution of the other pairs of samples in the batch to weight the loss, as follows:

$$\begin{array}{*{20}c} {{\mathscr{L}}_{MS} = \frac{1}{N}\mathop \sum \limits_{i = 1}^{N} \left\{ {\frac{1}{\alpha }\log \left[ {1 + \mathop \sum \limits_{{j \in {\mathcal{P}}_{i} }} e^{{ - \alpha \left( {S_{ij} - m} \right)}} } \right] + \frac{1}{\beta }\log \left[ {1 + \mathop \sum \limits_{{k \in {\mathcal{N}}_{i} }} e^{{\beta \left( {S_{ik} - m} \right)}} } \right]} \right\}} \\ \end{array}$$

(7)

where ${\mathcal{P}}_{j}$ represents the set of positive sample pairs for each instance in each batch; ${\mathcal{N}}_{i}$ represents the set of negative-sample pairs for each instance in each batch; $S_{ij}$ and $S_{ik}$ denote the similarities between the two images; and $\alpha$, $\beta$, and $m$ are fixed hyperparameters.

Experiential results

In this section, we validate the proposed method on several VPR benchmark datasets and compare it with several state-of-the-art VPR methods to demonstrate the superiority of the proposed method. We describe some of the details of the experiments, including the hyperparameters used, datasets, and evaluation metric (section “Implementation details”). We show the comparison and analysis of our proposed ConvMLP orthogonal fusion of multiscale features method with several other VPR methods in our experimental results (section “Comparison with existing methods”). We demonstrated the effectiveness of each component used in our architecture through ablation experiments (section “Ablation studies”).

Implementation details

Parameters

We used ResNet50, pre-trained on ImageNet with the last convolutional and classification layers trimmed off, as the feature extraction backbone and trained it on the GSV-Cities⁵⁰ dataset, which is a large-scale dataset consisting of more than 560,000 images containing more than 67,000 places. As in the approach used by Ali-Bey et al.⁵⁰, we used multi-similarity loss as the loss function of the model, where each batch contained 120 locations, and 4 images were randomly selected for each location; thus, the batch size was 120 × 4 = 480. We used SGD for optimization, with an initial learning rate of 0.05, momentum of 0.9, and weight decay of 0.001; additionally, we used MultiStepLR to decrease the learning rate to the original 0.3 every five epochs. We trained for a maximum of 30 epochs using images that were resized to 320 × 320 pixels.

Datasets

We used four datasets—Pittsburgh⁷, MSLS⁵¹, SPED⁵², and Nordland⁵²—to evaluate the proposed architecture. We used two subsets of Pittsburgh: Pitts250k, which contained 8280 query images and 83,952 reference images, and Pitts30k, which contained 7608 query images and 10,000 reference images, all of which were collected from Google Street View and were mainly characterized as viewpoint changes. The MSLS contained 11,120 query images and 18,916 reference images collected from a dashcam, which contained significant viewpoints and lighting variations. The SPED contained 607 query images and 607 reference images collected from surveillance cameras, which primarily represent intense illumination changes and seasonal variations. Nordland contained 2760 query images and 27,592 reference images, including extreme illumination and appearance changes, making it a challenging dataset.

Evaluation metric

We followed the same evaluation metrics as in previous studies^11,18,43 and used Recall@N as a metric to evaluate the model capability. A query image was considered to be successfully retrieved if at least one of the first N retrieved reference images was located within 25 m of the query image.

Comparison with existing methods

Comparing with single-stage framework

In this section, we compare several single-stage VPR methods based on the global descriptors, AVG¹¹, GeM²⁹, NetVLAD¹¹, SPE-NetVLAD¹³, GatedNetVLAD⁵³, CosPlace³⁹, and ConvAP⁵⁰, with the proposed ConvMLP-OFMS architecture. All methods used the same feature extraction network and were trained on GSV-Cities. We also referred to some of the works in⁵⁰ and the final results are listed in Table 1.

Table 1 Comparison of different techniques for popular benchmarks. The baseline represents the global feature descriptor obtained using adaptive average pooling. Significant values are in bold.

Full size table

As shown in Table 1, the proposed method outperforms other methods on several VPR benchmark datasets. On the Pitts250k dataset, we achieved 92.5% Recall@1, which is a slight improvement over the previous methods. On the MSLS dataset, we obtained 86.5% Recall@1, representing improvements of 2% and 3.1% compared with CosPlace and ConvAP, respectively. This demonstrates the ability of the proposed architecture to cope effectively with viewpoint and illumination variations in the VPR. On the SPED and NordLand datasets, which have extreme illumination and appearance variations, we achieved optimal performances of 80.6% and 43.2%, respectively. In addition, as shown in Table 1, after the orthogonal fusion of multiscale features based on ConvMLP, Recall@1 is improved on all datasets by up to 3%, indicating the effectiveness of the orthogonal fusion of the multiscale information encoding strategy that we have adopted. Figure 5 shows the top five retrieval results of our method under difficult conditions, and it can be seen that our proposed method is successful in localization even in extreme environmental changes.

We also compare the computational cost of our proposed method in terms of the floating point operations (FLOPs), number of parameters, and inference time for a single image; the results are shown in Table 1. Our method has a higher computational cost than the AVG and NetVLAD methods, but is far superior in terms of Recall@1. Although the FLOPs and parameter counts of our proposed ConvMLP are slightly higher than those of ConvAP, the inference is faster, the recall performance is superior, and the comprehensive performance is better. After orthogonal fusion of multiscale features on top of ConvMLP, the recall performance is improved, despite the increased computational cost. The performance improvement is particularly clear on the MSLS and SPED datasets. Moreover, compared to other methods that utilize multiscale information, our method sacrifices less for computational cost and performs better, which fully demonstrates the superiority of our proposed method.

Comparing against two-stage methods

As mentioned in section “Visual place recognition”, our proposed method belongs to the single-stage framework, and there is another class of methods belonging to the two-stage framework, which primarily uses local features to optimize the retrieval results of the single-stage framework. It is well known that this can significantly improve performance, but at the cost of more computation time and memory. We compared it with SuperGlue⁵⁴, Patch-NetVLAD¹², TransVPR¹⁷, and R2Former³¹, all of which are advanced two-stage techniques. Table 2 shows the results of the comparison with two-stage techniques, from which it can be seen that our method outperforms most two-stage techniques in terms of Recall@N performance, and at the same time, significantly outperforms existing two-stage methods in terms of the latency time. Although our method does not perform reranking, the performance of our method is worse than that of the existing state-of-the-art R2Former on the MSLS dataset, and even outperforms R2Former on the Pitts30k dataset, with an improvement of 0.6% in Recall@1 and 1.8% in Recall@5. Meanwhile, our method takes only 5.1 ms to complete the feature extraction of an image, which is faster than all existing methods, and does not require equivalent time for reranking.

Table 2 Comparing against two-stage methods in Recall@N, Extraction and Reranking Latency per query is measured on MSLS using NVIDIA RTX A5000. Reranking is done for the top 100 candidates. Significant values are in bold.

Full size table

Comparison with methods utilizing multiscale information

As shown in Fig. 6, compared to other VPR methods utilizing multiscale information, our proposed architecture not only has a high retrieval precision with 86.5% Recall@1 on the MSLS dataset, which is 8.3% and 3.7% higher than SPE-NetVLAD¹³ and MultiRes-NetVLAD¹⁸, respectively, but also requires only 50.6% and 77.4% as many FLOPs compared to the former and the latter.

Ablation studies

Importance of ConvMLP

To reflect the role of ConvMLP, the global descriptors for retrieval in an ablation experiment were obtained in this section by stacking the number of ConvMLPs $D$. Table 3 presents the results. We set $D \in \left\{ {0,\;1,\;2,\;4} \right\}$ to perform four sets of experiments; when $D = 0$, which is the baseline model we used, the global descriptors were obtained using adaptive average pooling for the feature maps obtained from the backbone network. When $D = 1$, the Recall@1 performance for Pitts30k increased from 83.94 to 91.67%, an improvement of 7.73%, and the performance for MSLS increased from 71.49 to 84.05%, an improvement of 12.56%. Further increases in D produced little improvement in the results for the Pitts30k and MSLS datasets and the accuracy deteriorated. Considering the increase in the number of parameters and FLOPs caused by stacking the ConvMLP, we chose to use $D = 1$ as the benchmark.

Table 3 Ablation of ConvMLP blocks. Significant values are in bold.

Full size table

To illustrate the feature extraction and expression ability of the ConvMLP more intuitively, the heat maps of several methods on the input images are presented in Fig. 7, and the darker color indicates that the model pays more attention to the region. Compared with the ResNet50, CosPlace, and ConvAP methods, the proposed ConvMLP can more accurately highlight the content of the query image, which indicates that the method has a stronger ability to express the key features and can efficiently extract the more critical semantic feature information in the query image, thus achieving better performance.

Effects of enhanced spatial attention

This section demonstrates the effectiveness of the proposed ESA method through experimental comparisons. The experimental results are listed in Table 4. The results show that adding the ESA improves Recall@1 on the Pitts30k and MSLS datasets by 1.6% and 3.65%, respectively. It also indicates that adding the ESA can effectively remove the noise of the shallow network and make the model focus on more valuable information for VPR.

Table 4 Effectiveness of ESA. “Multiscale” denotes the retrieval query using only the multiscale features $f_{s}$ generated during the feature extraction, and MS + ESA denotes the experimental results of adding ESA to the multiscale features. Significant values are in bold.

Full size table

This experiment also shows that if the multiscale information generated in the feature extraction is used alone to construct image descriptors for retrieval and recognition, the Recall@1 on Pitts30k and MSLS is 8.31% and 13.78% lower than those of the global image descriptors obtained by ConvMLP, respectively. This indicates that the construction of image descriptors using the multiscale information generated during feature extraction alone is not applicable to solving the VPR problems because the multiscale information generated during feature extraction contains more shallow features, and the deep semantic information is under-represented. Combined with the experimental results in Table 1, it illustrates that using multiscale information to enhance the global descriptors obtained by ConvMLP can effectively increase the robustness and generalization of the descriptors, which again proves the effectiveness of the adopted orthogonal fusion of the multiscale feature strategy.

Validation of the orthogonal fusion module

To demonstrate the effectiveness of the orthogonal fusion, we conducted a comparison experiment by removing the orthogonal fusion module shown in Fig. 1 and directly splicing and fusing the multiscale feature $f_{s}$ with the global feature $f_{g}$. We also explored the fusion of two vectors using the Hadamard product, which is a common method for fusing two detectors. Table 5 lists the experimental results. Compared to the common fusion method of tensor splicing, our proposed method improves Recall@1 by 2.31% and 5.54% for Pitts30k and MSLS, respectively. This shows that through the process of orthogonal projection, redundant information in the multiscale features can be eliminated so that the output multiscale information is richer and more informative. Thus, a large number of shallow features in multiscale information does not affect the performance of global descriptors, thereby achieving complementary enhancement.

Table 5 Compare with other fusion strategies. Significant values are in bold.

Full size table

Conclusion

In this study, we proposed ConvMLP, a new feature-aggregation method for VPR that aggregates channel information through convolution and spatial information through adaptive average pooling. Experiments showed that this method can effectively deal with viewpoint changes, illumination changes, and appearance differences in VPR. Second, we proposed an orthogonal projection fusion multiscale feature extraction strategy for the problem in which traditional methods do not fully utilize the multiscale information generated by feature extraction and the information redundancy problem of traditional feature fusion methods. Our proposed framework eliminates as much redundant information as possible in multiscale features by spatial attention and orthogonal projection. The proposed architecture achieved the best Recall@1 of 91.65% and 86.49% on Pitts30k and MSLS, respectively, indicating that it can effectively avoid the problems of information underutilization and redundant feature fusion. Our proposed method achieved good performance on several publicly available VPR benchmark datasets, with improvements ranging from 0.1 to 5% over existing VPR methods, and outperformed the best existing methods by 5% on the most challenging Nordland dataset. The proposed method can be generalized to other image-retrieval tasks in addition to VPR.

However, this study has some limitations and areas for improvement. In this study, the performance saturation phenomenon occurred prematurely when ablation experiments were conducted with stacked ConvMLP quantities. In addition, the performance of our method can still be improved for some datasets and the proposed method is also difficult to apply in regions without reference image database. In the future, we will incorporate local feature matching to refine the global retrieval results and further improve VPR performance.

Data availability

The data analyzed during the current study are available in GSV-Cities at https://doi.org/10.1016/j.neucom.2022.09.127.

References

Masone, C. & Caputo, B. A survey on deep visual place recognition. IEEE Access 9, 19516–19547. https://doi.org/10.1109/access.2021.3054937 (2021).
Article MATH Google Scholar
Zhang, X., Wang, L. & Su, Y. Visual place recognition: A survey from deep learning perspective. Pattern Recognit. 113, 107760. https://doi.org/10.1016/j.patcog.2020.107760 (2021).
Article MATH Google Scholar
Doan, D. et al. Scalable place recognition under appearance change for autonomous driving. In 2019 Proc. IEEE/CVF Int. Conf. Comput. Vis (ICCV). https://doi.org/10.1109/iccv.2019.00941 (2019).
Liu, Z. et al. LPD-net: 3D point cloud learning for large-scale place recognition and environment analysis. In 2019 Proc. IEEE/CVF Int. Conf. Comput. Vis (ICCV). https://doi.org/10.1109/iccv.2019.00292 (2019).
Zhuang, J., Dai, M., Chen, X. & Zheng, E. A faster and more effective cross-view matching method of UAV and satellite images for UAV geolocalization. Remote Sens. 13, 3979. https://doi.org/10.3390/rs13193979 (2021).
Article ADS MATH Google Scholar
Torii, A., Arandjelović, R., Sivic, J., Okutomi, M. & Pajdla, T. 24/7 place recognition by view synthesis. IEEE Trans. Pattern Anal. Mach. Intell. 40, 257–271. https://doi.org/10.1109/tpami.2017.2667665 (2018).
Article PubMed MATH Google Scholar
Torii, A., Sivic, J., Okutomi, M. & Pajdla, T. Visual place recognition with repetitive structures. IEEE Trans. Pattern Anal. Mach. Intell. 37, 2346–2359. https://doi.org/10.1109/tpami.2015.2409868 (2015).
Article PubMed MATH Google Scholar
Lowe, D. G. Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vis. 60, 91–110. https://doi.org/10.1023/b:visi.0000029664.99615.94 (2004).
Article MATH Google Scholar
Dalal, N. & Triggs, B. Histograms of oriented gradients for human detection. In 2005 Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit (CVPR’05). https://doi.org/10.1109/cvpr.2005.177 (2005).
Zhu, Y., Wang, J., **e, L. & Zheng, L. Attention-based pyramid aggregation network for visual place recognition. In Proc. 26th ACM Int. Conf. Multimedia. https://doi.org/10.1145/3240508.3240525 (2018).
Arandjelović, R., Gronat, P., Torii, A., Pajdla, T. & Sivic, J. NetVLAD: CNN architecture for weakly supervised place recognition. IEEE Trans. Pattern Anal. Mach. Intell. 40, 1437–1451. https://doi.org/10.1109/tpami.2017.2711011 (2018).
Article PubMed MATH Google Scholar
Hausler, S., Garg, S., Xu, M., Milford, M. & Fischer, T. Patch-NetVLAD: Multi-scale fusion of locally-global descriptors for place recognition. In 2021 Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit (CVPR). https://doi.org/10.1109/cvpr46437.2021.01392 (2021).
Yu, J., Zhu, C., Zhang, J., Huang, Q. & Tao, D. Spatial pyramid-enhanced NetVLAD with weighted triplet loss for place recognition. IEEE Trans. Neural Netw. Learn. Syst. 31, 661–674. https://doi.org/10.1109/tnnls.2019.2908982 (2020).
Article PubMed MATH Google Scholar
Cao, B., Araujo, A. & Sim, S. Unifying deep local and global features for image search. Lect. Notes Comput. Sci. https://doi.org/10.1007/978-3-030-58565-5_43 (2020).
Article MATH Google Scholar
Kim, H. J., Dunn, E. & Frahm, J.-M. Learned contextual feature reweighting for image Geo-localization. In 2017 Proc. IEEE Conf. Comput. Vis. Pattern Recognit (CVPR). https://doi.org/10.1109/cvpr.2017.346 (2017).
Sheng, D. et al. NYU-VPR: Long-term visual place recognition benchmark with view direction and data anonymization influences. In 2021 Proc. IEEE/RSJ Int. Conf. Intell. Robots Syst (IROS). https://doi.org/10.1109/iros51168.2021.9636640 (2021).
Wang, R., Shen, Y., Zuo, W., Zhou, S. & Zheng, N. TransVPR: Transformer-based place recognition with multi-level attention aggregation. In 2022 Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit (CVPR). https://doi.org/10.1109/cvpr52688.2022.01328 (2022).
Khaliq, A., Milford, M. & Garg, S. MultiRes-NetVLAD: Augmenting place recognition training with low-resolution imagery. IEEE Robot. Autom. Lett. 7, 3882–3889. https://doi.org/10.1109/lra.2022.3147257 (2022).
Article Google Scholar
Samsipour, G., Fekri-Ershad, S., Sharifi, M. & Alaei, A. Improve the efficiency of handcrafted features in image retrieval by adding selected feature generating layers of deep convolutional neural networks. Signal Image Video Process. 18, 2607–2620. https://doi.org/10.1007/s11760-023-02934-z (2024).
Article Google Scholar
Le, D. C. & Youn, C. H. City-scale visual place recognition with deep local features based on multi-scale ordered VLAD pooling. ar**v preprint https://arxiv.org/abs/2009.09255 (2020).
Li, Z., Zhou, A., Wang, M. & Shen, Y. Deep fusion of multi-layers salient CNN features and similarity network for robust visual place recognition. In 2019 Proc. IEEE Int. Conf. Robot. Biomimetics (ROBIO). https://doi.org/10.1109/robio49542.2019.8961602 (2019).
**n, Z. et al. Localizing discriminative visual landmarks for place recognition. In 2019 Proc. IEEE Int. Conf. Robot. Autom. (ICRA). https://doi.org/10.1109/icra.2019.8794383 (2019).
Galvez-López, D. & Tardos, J. D. Bags of binary words for fast place recognition in image sequences. IEEE Trans. Robot. 28, 1188–1197. https://doi.org/10.1109/tro.2012.2197158 (2012).
Article MATH Google Scholar
Jégou, H., Douze, M., Schmid, C. & Perez, P. Aggregating local descriptors into a compact image representation. In 2010 Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit (CVPR). https://doi.org/10.1109/cvpr.2010.5540039 (2010).
He, K., Zhang, X., Ren, S. & Sun, J. Deep residual learning for image recognition. In 2016 Proc. IEEE Conf. Comput. Vis. Pattern Recognit (CVPR). https://doi.org/10.1109/cvpr.2016.90 (2016).
Zhengxia, Z., Keyan, C., Zhenwei, S., Yuhong, G. & Jie**, Y. Object detection in 20 years: A survey. Proc. IEEE 111, 257–276. https://doi.org/10.1109/jproc.2023.3238524 (2023).
Article MATH Google Scholar
Minaee, S. et al. Image segmentation using deep learning: A survey. IEEE Trans. Pattern Anal. Mach. Intell. 44, 3523–3542. https://doi.org/10.1109/tpami.2021.3059968 (2022).
Article PubMed MATH Google Scholar
Dosovitskiy, A. et al. An image is worth 16 × 16 words: Transformers for image recognition at scale. ar**v preprint https://arxiv.org/abs/2010.11929 (2020).
Radenović, F., Tolias, G. & Chum, O. Fine-tuning CNN image retrieval with no human annotation. IEEE Trans. Pattern Anal. Mach. Intell. 41, 1655–1668. https://doi.org/10.1109/tpami.2018.2846566 (2019).
Article PubMed Google Scholar
Revaud, J., Almazan, J., Rezende, R. & Souza, C. D. Learning with average precision: Training image retrieval with a listwise loss. In 2019 Proc. IEEE/CVF Int. Conf. Comput Vis (ICCV). https://doi.org/10.1109/iccv.2019.00521 (2019).
Zhu, S. et al. R2 Former: Retrieval and reranking transformer for place recognition. In 2023 Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit (CVPR). https://doi.org/10.1109/cvpr52729.2023.01856 (2023).
Kushwaha, A., Khare, M., Bommisetty, R. M. & Khare, A. Human activity recognition based on video summarization and deep convolutional neural network. Comput. J. https://doi.org/10.1093/comjnl/bxae028 (2024).
Article MATH Google Scholar
Yong, W. et al. IAUnet: Global context-aware feature learning for person reidentification. IEEE Trans. Neural Netw. Learn. Syst. 34, 4460–4474. https://doi.org/10.1109/tnnls.2020.3017939 (2021).
Article MATH Google Scholar
Li, J., Hassani, A., Walton, S. & Shi, H. ConvMLP: Hierarchical convolutional MLPs for vision. In 2023 Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit Workshops (CVPRW). https://doi.org/10.1109/cvprw59228.2023.00671 (2023).
Tolstikhin, I. O. et al. MLP-Mixer: An all-MLP architecture for vision. ar**v preprint https://arxiv.org/abs/2105.01601 (2021).
Touvron, H. et al. ResMLP: Feedforward networks for image classification with data-efficient training. IEEE Trans. Pattern Anal. Mach. Intell. 45, 5314–5321. https://doi.org/10.1109/tpami.2022.3206148 (2022).
Article MATH Google Scholar
Liu, H., Dai, Z., So, D. R. & Le, Q. V. Pay Attention to MLPs. ar** a tuned three-layer perceptron fed with trained deep convolutional neural networks for cervical cancer diagnosis. Diagnostics 13, 686–686. https://doi.org/10.3390/diagnostics13040686 (2023).
Article PubMed PubMed Central Google Scholar
Berton, G., Masone, C. & Caputo, B. Rethinking visual Geo-localization for large-scale applications. In 2022 Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit (CVPR). https://doi.org/10.1109/cvpr52688.2022.00483 (2022).
Peng, G., Zhang, J., Li, H. & Wang, D. Attentional pyramid pooling of salient visual residuals for place recognition. In 2021 Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV). https://doi.org/10.1109/iccv48922.2021.00092 (2021).
Xu, Y. et al. TransVLAD: Multi-scale attention-based global descriptors for visual Geo-localization. In 2023 Proc. IEEE/CVF Winter Conf. Appl. Comput. Vis. (WACV). https://doi.org/10.1109/wacv56688.2023.00286 (2023).
Kushwaha, A., Khare, A. & Prakash, O. Micro-network-based deep convolutional neural network for human activity recognition from realistic and multi-view visual data. Neural Comput. Appl. 35, 13321–13341. https://doi.org/10.1007/s00521-023-08440-0 (2023).
Article MATH Google Scholar
Berton, G. et al. Deep visual Geo-localization benchmark. In 2022 Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit (CVPR). https://doi.org/10.1109/cvpr52688.2022.00532 (2022).
Ali-Bey, A., Chaib-Draa, B. & Giguere, P. MixVPR: Feature mixing for visual place recognition. In 2023 Proc. IEEE Winter Conf. Appl. Comput. Vis (WACV). https://doi.org/10.1109/wacv56688.2023.00301 (2023).
Woo, S., Park, J., Lee, J.-Y. & Kweon, I. S. CBAM: Convolutional block attention module. Lect. Notes Comput. Sci. https://doi.org/10.1007/978-3-030-01234-2_1 (2018).
Article MATH Google Scholar
Qin, Q., Hu, W. & Liu, B. Feature projection for improved text classification. In Proc. 58th Annual Meeting of Comput Linguist Assoc Comput Linguist. https://doi.org/10.18653/v1/2020.acl-main.726 (2020).
Yang, M. et al. DOLG: Single-stage image retrieval with deep orthogonal fusion of local and global features. In 2021 Proc. IEEE/CVF Int. Conf. Comput. Vis (ICCV). https://doi.org/10.1109/iccv48922.2021.01156 (2021).
Schroff, F., Kalenichenko, D. & Philbin, J. FaceNet: A unified embedding for face recognition and clustering. In 2015 Proc. IEEE Conf. Comput. Vis. Pattern Recognit (CVPR). https://doi.org/10.1109/cvpr.2015.7298682 (2015).
Wang, X., Han, X., Huang, W., Dong, D. & Scott, M. R. Multi-similarity loss with general pair weighting for deep metric learning. In 2019 Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit (CVPR). https://doi.org/10.1109/cvpr.2019.00516 (2019).
Ali-bey, A., Chaib-draa, B. & Giguère, P. GSV-Cities: Toward appropriate supervised visual place recognition. Neurocomputing 513, 194–203. https://doi.org/10.1016/j.neucom.2022.09.127 (2022).
Article Google Scholar
Warburg, F. et al. Mapillary street-level sequences: A dataset for lifelong place recognition. In 2020 Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit (CVPR). https://doi.org/10.1109/cvpr42600.2020.00270 (2020).
Zaffar, M. et al. VPR-Bench: An open-source visual place recognition evaluation framework with quantifiable viewpoint and appearance change. Int. J. Comput. Vis. 129, 2136–2174. https://doi.org/10.1007/s11263-021-01469-5 (2021).
Article MATH Google Scholar
Zhang, J., Cao, Y. & Wu, Q. Vector of locally and adaptively aggregated descriptors for image feature representation. Pattern Recognit. 116, 107952. https://doi.org/10.1016/j.patcog.2021.107952 (2021).
Article MATH Google Scholar
Sarlin, P.-E., DeTone, D., Malisiewicz, T. & Rabinovich, A. Superglue: Learning feature matching with graph neural networks. In 2020 Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit (CVPR). https://doi.org/10.1109/cvpr42600.2020.00499 (2020).

Download references

Acknowledgements

This work was supported by the National Natural Science Foundation of China, grant number 42001338.

Author information

Authors and Affiliations

Institute of Geospatial Information, PLA Strategic Support Force Information Engineering University, Zhengzhou, 450001, China
Wenjian Gan, Yang Zhou, **aofei Hu, Luying Zhao, Gaoshuang Huang & Chenglong Zhang

Authors

Wenjian Gan
View author publications
You can also search for this author in PubMed Google Scholar
Yang Zhou
View author publications
You can also search for this author in PubMed Google Scholar
**aofei Hu
View author publications
You can also search for this author in PubMed Google Scholar
Luying Zhao
View author publications
You can also search for this author in PubMed Google Scholar
Gaoshuang Huang
View author publications
You can also search for this author in PubMed Google Scholar
Chenglong Zhang
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

W.G. developed the algorithms, ran the experiments and wrote most of the paper. Y.Z. and X.H. reviewed and revised the original manuscript. L.Z., G.H. and C.Z. collected the data and performed part of the experimental validation.

Corresponding author

Correspondence to Yang Zhou.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher's note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Gan, W., Zhou, Y., Hu, X. et al. Convolutional MLP orthogonal fusion of multiscale features for visual place recognition. Sci Rep 14, 11756 (2024). https://doi.org/10.1038/s41598-024-62749-x

Download citation

Received: 11 December 2023
Accepted: 21 May 2024
Published: 23 May 2024
DOI: https://doi.org/10.1038/s41598-024-62749-x
Springer Nature Limited

Convolutional MLP orthogonal fusion of multiscale features for visual place recognition

Abstract

Similar content being viewed by others

LocalSPED: A Classification Pipeline that Can Learn Local Features for Place Recognition Using a Small Training Set

MLIFeat: Multi-level Information Fusion Based Deep Local Features

Simple ConvNet Based on Bag of MLP-Based Local Descriptors

Introduction

Related work

Visual place recognition

Leveraging multiscale information

Multiscale feature orthogonal fusion

Multi-similarity loss

Experiential results

Implementation details

Parameters

Datasets

Evaluation metric

Comparison with existing methods

Comparing with single-stage framework

Comparing against two-stage methods

Comparison with methods utilizing multiscale information

Ablation studies

Importance of ConvMLP

Effects of enhanced spatial attention

Validation of the orthogonal fusion module

Conclusion

Data availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Additional information

Publisher's note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation