Introduction

In recent years, robots have become essential assistants in various fields, including 3D scene reconstruction, target detection, autonomous driving, among others. The pervasive application of robotics technology across various industries has contributed to its integral role in modern life. Computer vision, a technology that emulates the human visual system and converts collected image information into target disparity information, plays a crucial role in assisting robots in accomplishing their tasks. Currently, a majority of robots rely on costly laser radar equipment to obtain high-precision disparity information. However, the principle of binocular vision, which closely replicates humans’ way of observing objects, is widely utilized in numerous visual tasks. The binocular stereo matching algorithm, a fundamental component of the binocular vision theory, directly impacts the accuracy of a robot’s target detection. By employing binocular vision theory, the robot can convert two-dimensional information into three-dimensional information of the target scene, thereby obtaining precise target scene information.

Stereo matching algorithms are crucial for understanding 3D scenes and reconstruction, and have been widely used in various fields, including robot navigation1, autonomous driving2, virtual reality3, and many others. These algorithms aim to calculate disparities, which represent the horizontal displacement of corresponding pixels in two rectified stereo pairs. Traditional methods often rely on prior knowledge of the image to construct a stereo matching function that enables the generation of a dense disparity map4.

Currently, convolutional neural networks (CNNs) are widely used in various vision tasks due to their powerful feature representation capabilities, including object detection5, image classification6, and more. In recent years, supervised stereo matching algorithms based on CNN have significantly improved the performance of stereo matching and have become the current mainstream research direction. The primary steps of the supervised stereo matching algorithm based on CNN include feature extraction, cost construction, and cost optimization.

However, the existing CNN-based stereo matching algorithms are primarily designed for fixed-structure models on specific datasets, while the issue of domain adaptive stereo matching has received limited attention from researchers. Moreover, previous studies have typically focused on obtaining network parameters through extensive training with large batches, disregarding the exploration of alternative training strategies. Kendall et al. were the first to proposed to obtain features through the ResNet7 structure and obtain disparity maps in an end-to-end manner. The domain adaptation module designed in DANet8 helps to reduce the domain shift. To enhance stereo matching performance, SegStereo9 incorporates a separately trainable semantic branch that provides disparity edge information for stereo matching. The optimization branch in this method employs a two-stage training process to eliminate redundant information and amplify matching-related information in concatenated volumes10. Nlca-net11 provides a bootstrap branch for optimizing disparity results. A semantic segmentation branch is proposed in work12 to incorporate additional semantic information into stereo matching tasks. PGNet13 proposed a panoptic parsing guided deep network to solve the stereo matching task. A cascading fusion cost volume is proposed to optimize the cost distribution14. Rao et al.15 enhanced the stereo matching performance of an existing model by implementing a new training strategy during retraining. Sang et al.16 proposed a spatial pyramid pooling attention module to address ill-posed areas and enhance the details of disparity maps through multi-scale context information capture. The above methods enhance stereo matching performance by optimizing the model’s structure and training strategy.

We present a novel stereo matching network that utilizes transfer learning and a customized training strategy to optimize the model. Firstly, we select a prototype network to provide improved parameter initialization for the stereo matching task. Next, to address the issue of inadequate feature learning, we employ a pre-trained model on large-scale datasets to extract general features. These features are then filtered to construct cost volumes that capture the similarity between stereo pairs. Furthermore, we train a feature adapter to enhance the screening capability of features for stereo matching, thereby minimizing the interference from non-stereo matching learning parameters. In contrast to existing algorithms that rely on single-scale features for cost construction, our approach incorporates a domain adaptive cost optimization module that replaces the original module in the prototype. Additionally, to further refine the cost volumes, we adjust the disparity range. Finally, we obtain the final disparity map through a regression method. In summary, there are three contributions in our paper:

  • A domain adaptive stereo matching model for robots is proposed, which optimizes the stereo matching performance by grafting general features. Experiments conducted on multiple datasets and real-world scenes demonstrate that the model exhibits remarkable effectiveness across different domains.

  • To capture general feature information, a grafted feature extractor is introduced and adapted to the network using a feature adapter. Additionally, an adaptive cost optimization module is introduced, and a disparity score prediction module is designed to adaptively adjust the disparity search range to optimize the cost distribution.

  • A training strategy is proposed to train the prototype, feature adapter and domain adaptive cost optimization module, which provide better phased parameter initialization and update network parameters stage by stage, in addition, the training strategy of stereo matching is studied in this paper.

The paper is organized as follows. Section ”Related Works” presents the relevant background of stereo matching and introduces related work on traditional and deep-learning based algorithms for stereo matching. The implementation details of the proposed model (Ct-Net) are presented in Sect. ”Proposed Method”. Section ”Experimental Results and Discussions” provides details on the datasets used, experimental results, and discussions. Finally, the paper concludes with a summary and conclusion in Sect. ”Conclusion”.

Related works

To date, robots have been widely applied in various fields and played an undeniable role. Shankar et al.19 presented the Swarm-SLAM system for collaborative simultaneous localization and map**, which can be effectively applied to swarm robotics. Yang et al.20 proposed a CNN-based binocular vision self-inpainting network for real-time stereo image inpainting of autonomous robots, achieving state-of-the-art performance on image inpainting. Shim et al.21 proposed an inspection robot and management system that utilizes stereo vision to inspect damage on concrete surfaces. Obasekore et al.22 developed a recognition algorithm that utilizes a CNN-based binocular vision system in their agricultural robot to detect early-developmental pest stages in agriculture. Similarly, ** is calculated group by group. The feature channel is represented as \({N_{c}}\). All features are divided into \({N_{g}}\) groups along the channel dimension. The calculation formula of group correlation can be expressed as follows,

$$\begin{aligned} C_{g w c}(d, x, y, g)=\frac{1}{N_{c} / N_{g}}<f_{l}^{g}(x, y), f_{r}^{g}(x-d, y)> \end{aligned}$$
(1)

where \(<,>\) represents the inner product operation, and the correlation of features is calculated for the feature group g and all disparity levels d.

Due to the influence of the ill-posed regions, the initial cost contains massive noise information. The noise information of multi-scale costs is further filtered out by 3D codec. The 3D codec mainly includes 3D convolution layers and 3D deconvolution layers. Figure 2 shows the main structure of the 3D codec. Additionally, we cascade the filtered multi-scale costs to increase the interaction of multi-scale information. Specifically, the high-scale cost merged with the up-sampled low-scale cost using the addition operation, which increases semantic information acquisition and reduces the loss of detailed information.

Figure 2
figure 2

3D codec structure.

The cost reflects the matching similarity between candidate pixels. However, the cost distribution of pixels is often multimodal, as shown in the low-scale cost of Fig. 2. This can result in a high disparity error. To alleviate the above problem, after the fusion of three matching costs from low to high, we adjust the next cost distribution by predicting the disparity samples. First, we predict the disparity score for each spatial point, which is then used as input for constructing the last two matching costs. The formula of the disparity score prediction is as follows:

$$\begin{aligned} \left\{ \begin{array}{l} {\hat{d}}=\sum _{\forall d} d \times \sigma \left( -c_{d}\right) \\ F_{\textrm{score}}=\sum _{\forall d}|d-\hat{d}| \times \sigma \left( -c_{d}\right) \end{array} \right. \end{aligned}$$
(2)

among them, \(\hat{d}\) represents the predicted disparity, d represents the candidate disparity, \(\sigma \) represents the softmax operation and \(c_{d}\) represents the matching cost. The disparity search range of the next stage can be adjusted according to the disparity score. The disparity search range of each point (ij) in the next stage can be expressed as:

$$\begin{aligned} \left\{ \begin{array}{l} d_{\min }(i, j)=\hat{d}(i, j)-\alpha F_{\text{ score } }(i, j) \\ d_{\max }(i, j)=\hat{d}(i, j)+\alpha F_{\text{ score } }(i, j) \end{array}\right. \end{aligned}$$
(3)

\(\alpha \) is initialized to 1, which can be learned by the network.

Due to the different scales of the predicted disparity score map, the obtained disparity range maps are respectively up-sampled by bilinear interpolation. After that, we obtain disparity samples of each point as the input of the next step by uniform sampling between \(d_{\min }\) and \(d_{\max }\) , the disparity samples can be expressed as:

$$\begin{aligned} d_{\text{ sample } }=d_{\min }(i, j)+\frac{S}{S-1}\left( d_{\max }(i, j)-d_{\min }(i, j)\right) \end{aligned}$$
(4)

among them, S represents the disparity samples size of point (ij) and \(s \in (0,1,2, \ldots , S-1)\) . We fuse the disparity samples with the right feature map using a wrap** operation49, and then construct the matching cost using the group correlation method. This cost is optimized using the 3D codec.

Finally, we use the last disparity sample prediction module to obtain the final disparity image.

Evaluation metrics

In order to quantitatively evaluate the performance of our algorithm, we evaluate the proposed algorithm using xPE, where xPE represents the percentage of pixels for which the predicted disparity is off by more than x pixels, and EPE refers to the average difference between the predicted disparity and the ground truth.

The evaluation metrics can be expressed as follows:

$$\begin{aligned} x P E&=\frac{1}{N} \sum _{(i, j)}\left( \left| \hat{d}(i, j)-d^{*}(i, j)\right| _{1}>x\right) \times 100 \% \end{aligned}$$
(5)
$$\begin{aligned} E P E&=\frac{1}{N} \sum _{(x, y)}\left( \left| \hat{d}(i, j)-d^{*}(i, j)\right| _{1}\right) \end{aligned}$$
(6)

among them, N represents the total number of pixels, \(\hat{d}\) and \(d^{*}\) represents the predicted disparity and ground truth of pixel, respectively.

Experimental results and discussions

In this study, the proposed algorithm is implemented using PyTorch framework, trained and tested on a single NVIDIA Tesla V100 GPU with batch size set to 2. The Adam optimizer was used, and the parameters were set to \(\beta _1\) =0.9 and \(\beta _2\) =0.999. Scene Flow50 is used as the pre-training dataset, and KITTI51, Middlebury52, and Apollo53 are used to verify the performance of the algorithm.

Table 1 Model ablation experiments with different training strategies.

Dataset description

In the experimental part, we use SceneFlow, KITTI, Middlebury and Apollo datasets to train and test the model.

Scene Flow50: It is a large synthetic dataset with an image size of 960\(\times \)540 px, including 35,454 training image pairs and 4370 test image pairs. It provides the ground truth of disparity and the maximum disparity is 192. network training takes about 50 hours for 10 epochs, and the learning rate is set to 0.001.

KITTI51: Including KITTI2012 and KITTI2015, is a challenging and diverse road scene dataset with a size of 1236\(\times \)376 px, and only a sparse disparity map is provided as the training standard. We fine-tuned the model on these two data sets. It takes about 48 hours to train the network for 300 epochs, and the learning rate is set to 0.001 for the first 200 epochs and 0.0001 for the last 100 epochs.

Middlebury52: A small indoor dataset used to verify the generalization ability of the model for real scenes. The image is divided into three scales: F, H and Q. The data of scale Q is used for verification, and the maximum disparity is 256.

Apollo53: The Apollo dataset consists of 5165 image pairs and corresponding disparity maps, of which 3324 image pairs are used for training, 832 image pairs are used for validation, and 1009 image pairs are used for testing. Ground truth has been obtained by accumulating 3D point clouds from lidar and separately acquiring a dataset of 3D car instances. This dataset contains different traffic situations with severe occlusion, which is challenging.

For each stage, the SceneFlow dataset is used as a pre-training dataset to train the model because it contains many images and scenes, while the Middlebury, KITTI, and Apollo datasets are relatively small and test the model’s performance after model fine-tuning.

Analysis of experimental results

We conduct ablation studies on the training strategy and algorithm modules on the above five datasets.

First, we use the Scene Flow dataset to verify the impact of the training strategy on the model. The results of the ablation experiments are shown in Table 1. Compared with the model trained directly in stage 2, the 3PE and EPE are decreased when the model is trained in the second stage and pre-trained in stage 1. At the same time, compared with the model trained directly in stage 3, the model trained in stage 3, which pretrain in stage 1 and stage 2, 3PE and EPE metrics decline by 0.20% and 0.17px, respectively. The above ablation experiments show that a staged training strategy is helpful in improving model performance. Figure 3 shows the convergence process of different training strategies. Compared with the end-to-end model that was trained solely in stage 3, the model trained using the strategy of stage 3 (stage 1, stage 2) was better in terms of accuracy at different epochs. Furthermore, compared with the prototype in stage 1, the model in stage 2 showed a decrease in 3PE and EPE metrics, thereby verifying that the general feature extractor can improve the performance of stereo matching. These experiments show that different training strategies affect the performance of the final model. The ablation studies were conducted on different modules, and the results of the experiments are as follows.

Figure 3
figure 3

Convergence process of models with different training strategies. (stage x) means the pre-training model of stage x. It shows that the staged training strategy can decrease the matching error rate compared with end-to-end training strategy, and reasonable models can increase the upper limit of final’s results.

Table 2 Experimental results of different network settings on multiple datasets.
Figure 4
figure 4

Comparison of feature visualization samples obtained by feature extractor of prototype and GFE. From left to right, left image, the feature acquired by feature extractor of prototype, and the features acquired by GFE.

We compare the prototype with the grafted ResNet model mentioned in this paper. It can be seen from Table 2 that for the KITTI dataset, the 3PE and EPE metrics of the model with General Feature Extractor (GFE) drops from 4.6% and 0.89px to 3.9% and 0.83px, respectively. And for Middlebury datasets, the algorithm accuracy of GFE is also slightly improved. At the same time, since the parameters of the ResNet module in GFE have been pre-trained on the ImageNet dataset, and the parameters are fixed, there is no need to update the parameters during the model training stage, which relatively improves the efficiency of the model.

The qualitative analysis of the features acquired by different feature extractors was conducted. The feature visualization samples are shown in Fig. 4. It can be found that there are obvious differences between the two features obtained by the feature extractor of prototype and GFE. The latter contains more semantic and texture information, which is seen as the key information to deal with the high matching error rate of ill-posed regions. Both quantitative and qualitative results show that the GFE is beneficial for stereo matching tasks.

The ablation experimental results of domain adaptive cost optimization module are shown in Table 2, which shows that the domain adaptive cost optimization module (DACOM) can achieve better performance than the stacked hourglass structure of prototype. Specifically, on the KITTI dataset, compared with the stacked hourglass structure of prototype, the 3PE and EPE of the model with DACOM decrease from 5.3% and 0.94 to 3.5% and 0.82. Meanwhile, for Middlebury dataset, the 3PE and EPE metrics of DACOM decrease from 22.63% and 5.85–22.01% and 5.35, respectively. The quantitative results show that the adaptive cost optimization strategy achieves better performance.

Additionally, we conducted ablation experiments on the multi-scale cost cascade strategy, and the experimental results are shown in Table 3. From the results, it can see that as the multi-scale cost increases, 3PE and EPE metrics decrease simultaneously. Specifically, for Scene Flow, compared with only high scale cost used, 3PE and EPE metrics of (high, medium, and low costs) were decreased by 6.8% and 0.05px. To further explore the role of multi-scale cost in stereo matching tasks, the contrast experiment is set up and the results are shown in Fig. 5. From the results, it can see that the cost distribution tends to be multimodal at a single-scale cost (The solid blue line in Fig. 5), which is not beneficial to obtain optimal disparity results by matching costs. When we visualize the multi-scale cost, the cost distribution tends to be the unimodal distribution (The solid yellow line in Fig. 5), and the optimal cost value tends to the disparity ground truth (The disparity value corresponding to the yellow dotted line in Fig. 5). It can be deduced from the quantitative and qualitative results that multi-scale costs can reduce false matching due to distribution. We hypothesize that since the input image contains ill-posed regions, inaccurate initial low-scale matching cost often leads to matching errors and irreversible results, and the supplementary multi-scale information optimizes above phenomenon.

Table 3 Ablation experiments of the three multi-scale costs of the domain adaptive cost optimization module. We calculated 3PE and EPE on the Scene Flow and KITTI verification sets, respectively.
Figure 5
figure 5

Cost distribution of multi-scale costs. As the cost scale increases, the cost distribution gradually tends to be unimodal distribution and the peak is near the ground truth.

Figure 6
figure 6

Comparison of initial disparity, initial error, disparity score, optimized disparity, and optimized error. The binocular image from https://vision.middlebury.edu/stereo/. The error map tends to be warmer color to indicate a higher error rate. False disparity is always present on the ground or at the edge of objects, and the corresponding disparity score is relatively high in these areas. After adjusting for the disparity score, warm colors are significantly reduced in error map and the disparity edges becomes smoother.

As discussed above, matching costs are closely related to the disparity results, so how to further optimize matching cost become the key step. We use the Disparity Sample Prediction module to adaptively adjust the candidate disparity range before creating the matching cost. The ablation experiment results are shown in Table 2. From the results, it is evident that fusing the disparity samples before cost construction leads to a decrease in both 3PE and EPE metrics for Scene Flow and KITTI datasets. This suggests that adding disparity samples can improve the performance of stereo matching. Furthermore, as the disparity search range of each spatial point requires prediction before generating the disparity sample, and the generation of the disparity search range is based on the predicted disparity score, we visualize both the disparity score map and error map. The visualization results are presented in Fig. 6. As can be observed from the figure, regions with high disparity scores always exhibit higher errors, suggesting a close relationship between the disparity score and the regions that require optimization. The disparity map and error map optimized using the disparity score are superior to the initial disparity map and error map, highlighting the disparity adjustment capability of the domain adaptive cost optimization module.

Table 4 Ablation experiments of the disparity samples size S and stereo matching performance.

Furthermore, we set up ablation experiments to verify the relationship between disparity samples size S and stereo matching performance. The results are shown in Table 4, when S gradually increases, the stereo matching performance will gradually increase. This is also in line with common sense that the more disparity samples, the higher of the disparity accuracy. Weighing the time consumed by the network and accuracy, we set S to 30 in this paper. In summary, the Domain Adaptive Cost Optimization Module can optimize the cost distribution and further optimize the performance of stereo matching.

Finally, we conducted an ablation experiment on the loss function, and the results are shown in Table 2. Mixing the MAE loss function has better results than using only the Smooth L1 loss function.

Table 5 Comparing experimental results of our method with other methods on the KITTI 2012 benchmark.

Based on the above discussions, we can conclude that the proposed modules and training strategy are effective in improving the performance of stereo matching.

Cross-domain generalization performance

One of the primary challenges in cross-domain stereo matching is the domain shift problem. This issue arises when a model trained on one domain (or dataset) performs poorly when applied to a different domain due to variations in image characteristics, such as lighting conditions, camera parameters, and scene compositions.

In this section, to verify the cross-domain generalization performance of the algorithm, we selected the Middlebury, KITTI, and Apollo datasets as the test set and the Scene Flow dataset as the training set.

Experiments on the KITTI 2012 and KITTI 2015 dataset

The comparison results are presented in Tables 5, 6, 7, and 8. The final submission results on the KITTI benchmark are shown in Tables 5 and 6, and the evaluation metrics are xPE percentage for all regions and non-occluded (Noc) regions. In the KITTI 2012 benchmark, the proposed algorithm showed a significant improvement in xPE percentage compared to the traditional algorithm SGM27. Additionally, when compared to the high-precision deep learning algorithm AANet+54, which efficiently performs cost aggregation using sparse point-based feature representation, the proposed algorithm demonstrated lower xPE in all regions. Compared to other deep learning-based stereo matching algorithms, such as PVStereo55, PDSNet34, SegStereo9, and HSM56, the proposed algorithm achieved the lowest xPE percentage. However, when compared to the state-of-the-art methods CFNet and LEAStereo43, the proposed algorithm still performed relatively poorly.

Table 6 Comparing experimental results of our method with other methods on the KITTI 2015 benchmark.
Table 7 Comparing experimental results with other methods on the Middlebury dataset.
Table 8 Experimental results of our method compared with the prototype on the Apollo dataset.

In addition, as shown in the black box in Fig. 7, we can achieve better disparity prediction on image detail and overall target structure, and produce a smoother disparity image compared to SGM27. Compared with PSMNet32, although the disparity effect generated by SGM is better than that of the traditional algorithm, the algorithm can not produce correct disparity results in areas such as car Windows, and the algorithm proposed in this paper achieves better results on car Windows. SegStereo9 introduces image edge information to improve the disparity edge effect. Compared with SegStereo, the proposed algorithm achieves better results in fence railings and vehicle chassis. CFNet14 uses multi-scale cost optimization to obtain better disparity results. Compared with CFNet, the proposed algorithm achieves a comparable effect in disparity detail region. In addition, compared with LEAStereo43, which performs well in recent years, the disparity results produced by this algorithm also perform well on the road. Benchmark test results by KITTI 2012 show that the performance of this algorithm is comparable to that of existing advanced algorithms.

Figure 7
figure 7

Qualitative results of KITTI benchmark. In this article, we compared our method with disparity maps of other algorithms. The left two columns are KITTI2012 samples, and the right two columns are KITTI2015 samples. The black box in the image is the area with obvious difference.

Figure 8
figure 8

Qualitative results of the Middlebury dataset. The binocular image from https://vision.middlebury.edu/stereo/. From top to bottom, the left image, the ground truth GT, the disparity maps of Census, the disparity maps of FADNet, the disparity maps of iResNet, the disparity maps of Ct-Net (Ours).

Figure 9
figure 9

Qualitative results of Apollo test dataset. The binocular image from https://apolloscape.auto/stereo.html. The first row is the left images, the second row is the PSMNet disparity maps, and the third row is the predicted disparity maps by our network. The black box in the image is the area with obvious difference.

In the KITTI 2015 benchmark, when compared to the prototype PSMNet32, CtNet showed a significant improvement with a decrease of 19.3 and 21.1 % in \(3PE-fg\) for all regions and non-occluded regions, respectively. Furthermore, the proposed algorithm achieved a lower xPE percentage compared to others. Compared with the high-precision deep learning algorithm AANet+54, the proposed algorithm is improved in \(3PE-fg\) metric on all regions and non-occlusion regions. In addition, compared with other deep learning based stereo matching algorithms, such as PVStereo55, PDSNet34, SegStereo9 and HSM56, the proposed algorithm obtains the lowest xPE percentage. The qualitative results of the KITTI 2015 benchmark are shown in Figure 7. Our algorithm achieves more detailed and accurate predictions compared to SGM. Compared with the PDSNet34, SegStereo9, HSM56, the proposed algorithm achieves better results on fence railings and highway street signs. Similarly, compared to state-of-the-art algorithms CFNet and LEAStereo, the proposed algorithm still has room for improvement. However, qualitative and quantitative results on the KITTI 2015 benchmark demonstrate that our algorithm is well-suited for stereo matching tasks in road scenes.

Experiments on the middlebury dataset

The test results on Middlebury benchmark are shown in Table 7. Compared with the method based on deep learning, such as FADNet57, PSMNet32, and AANet54, the proposed method has a lower error rate on all samples. Compared with the high-precision deep learning algorithm iResNet58, the proposed algorithm performs better on the samples Bicycle2, Crusade, DjembeL, Livingroom, and Staircase. Compared with other samples, the difference in error rate is slight. Furthermore, the qualitative results are shown in Fig. 8, where we compare different methods on six samples from Middlebury. Compared with the traditional method Census24, the proposed algorithm attains better disparity edge stereo matching performance and improves the detection performance of ill-posed regions such as thin structures and texture-less regions. Compared with the method PSMNet32 based on deep learning, the proposed algorithm has a better stereo matching performance on details.

Experiments on the apollo dataset

Finally, we compared the proposed algorithm with PSMNet32 on the Apollo dataset. As shown in Table 8, our algorithm outperforms PSMNet32 in all metrics. The qualitative results are shown in Fig. 9. Compared with PSMNet32, the algorithm in this paper has better stereo matching performance in detail areas such as bicycles and pedestrians.

The qualitative and quantitative analysis results show that the proposed algorithm achieves promising results on multiple datasets.

Experiments for real-world scenes

This section verifies the performance of the algorithm proposed in the article in multiple real-world scenes. The experimental platform used in this article is shown in Fig. 10, consisting of a binocular vision system and a mobile base, with image size collected at 1280\(\times \)1024 px.

The hardware configuration of the car is as follows, it includes a pair of CMOS cameras to form the binocular camera system, which is used to obtain images from the left and right sides, and the camera takes 10 frames per second. In addition, the car uses an embedded processor, the operating system is Ubuntu 18.04, the processor is NVIDIA Jetson Nano. For the runtime environment, the algorithm is invoked by OpenCV for C++ to perform binocular stereo matching. In addition, the car is independently powered by a lithium battery. Among them, the distance measurement algorithm includes stereo matching algorithm, distance measurement and map modeling. Capture images and generate disparity maps in multiple indoor and outdoor scenes using this device. Since indoor disparity is usually higher than outdoor disparity, we take the maximum indoor disparity as 256 and the maximum outdoor disparity as 192. The results of outdoor experiments are presented in Fig. 11, while the results of indoor experiments are shown in Fig. 12.

Figure 10
figure 10

Experimental platform of binocular vision robot.

Figure 11
figure 11

Results in outdoor real-world scenes.

Figure 12
figure 12

Results in indoor real-world scenes.

It is worth noting that our cross-domain stereo matching model predicts disparity directly in the real scene without retraining, so this can test the cross-domain capability of our model. The experimental results of generating disparity maps in various real-world indoor and outdoor scenes demonstrate that the stereo matching algorithm proposed in this article exhibits valuable cross-domain generalization ability, and can satisfy the requirements of completing various tasks in robot vision.

Conclusion

Computer vision plays a crucial role in enabling robots to acquire depth information of objects and accomplish tasks by simulating the human visual system. This paper proposed a stereo matching network based on transfer learning for domain adaptive stereo matching tasks in robotics. The model is specifically designed to cater to requirements of robots in multiple scenes, and a comprehensive training strategy is formulated to train the network effectively. Furthermore, a general feature extractor is introduced to obtain general feature information, and an adapter is designed to adapt general features to a cost-optimized model of the network. To reduce the domain shift problem, an adaptive disparity optimization module is proposed in this paper to update disparity in stages. Compared with the prototype PSMNet, on KITTI 2015 benchmark, the \(3PE-fg\) of Ct-Net in all regions and non-occluded regions decreased by 19.3% and 21.1% respectively, and on the Middlebury dataset, the proposed algorithm improves the sample error rate at least 28.4%, which is the Staircase sample. Experiments on multiple datasets show that the proposed algorithm and training strategy can improve the cross-domain performance of stereo matching.

Our future research will focus on improving the generalization ability of the algorithm and conducting experiments in various domains. Specifically, we plan to integrate the attention mechanism of the Transformer to enhance the matching accuracy and explore the potential of the segmentation task to optimize the matching result of ill-posed regions. Ultimately, we aim to apply the proposed algorithm to an even wider range of real-world scenes.