Abstract
This paper tackles the challenge of accurate depth estimation from monocular laparoscopic images in dynamic surgical environments. The lack of reliable ground truth due to inconsistencies within these images makes this a complex task. Further complicating the learning process is the presence of noise elements like bleeding and smoke. We propose a model learning framework that uses a generic laparoscopic surgery video dataset for training, aimed at achieving precise monocular depth estimation in dynamic surgical settings. The architecture employs binocular disparity confidence information as a self-supervisory signal, along with the disparity information from a stereo laparoscope. Our method ensures robust learning amidst outliers, influenced by tissue deformation, smoke, and surgical instruments, by utilizing a unique loss function. This function adjusts the selection and weighting of depth data for learning based on their given confidence. We trained the model using the Hamlyn Dataset and verified it with Hamlyn Dataset test data and a static dataset. The results show exceptional generalization performance and efficacy for various scene dynamics, laparoscope types, and surgical sites.
Similar content being viewed by others
Introduction
The increasing demand for minimally invasive surgery in recent years has lead to the growing popularity of endoscopic surgery including laparoscopic operations1. One of the reasons for the widespread use of laparoscopic surgery is the increased sophistication of information provided by laparoscopes. In particular, with the advent of improved stereo laparoscopes and 4K/8K laparoscopes, it has become possible to acquire spatial information from laparoscopic images during surgery and examination. As a result, early detection of minute lesions, more accurate measurement of tumor size, and more precise grasp of the positional relationship between the organ and surgical instruments during surgery are now feasible. Furthermore, the acquired 3D information is expected to be applied to computer-assisted surgery, as typified by robot control support and surgical navigation2,3,4,5.
Laparoscopes are classified into stereo laparoscopes and monocular laparoscopes according to the number of optical systems at the tip of the scope. The mainstream stereoscopic laparoscope is a stereo laparoscope that incorporates two optical systems at the tip. With a stereo laparoscope, the depth can be estimated by stereo vision from the disparity of the left and right images. Semi-Global block Matching (SGM)6 and Efficient LArge-scale Stereo (ELAS)7 estimate depth from binocular disparity by performing matching on only one set of image pairs, with the latter method having real-time capability. These methods are able to perform depth estimation from a pair of images that are simultaneously acquired from a stereo camera, and reconstruct dense depth information even if the images are time-varying with deformation due to organ excision or gras**. Particularly in the in vivo depth estimation task, ELAS is a typical method that has been substituted as baseline and the ground truth of depth18 estimate depth with monocular images obtained from multiple viewpoints. However, a pseudo-stereo-like depth estimation method has a limit in the imaging speed because a physical acquisition of two images is inherently required. This limitation causes a decrease in frame per second (fps), so that its application to operation may be difficult where dynamic change is large. In addition, while multi-viewpoint depth estimation methods such as SfM and Visual SLAM estimate depth by reconstructing the camera’s posture and the three-dimensional structure of the feature points from the feature points of the image, the dynamic image scene practically obtained within the living body may provide only a scarce and uniform texture compared to general natural images, resulting in very non-uniform and sparse 3D reconstructed structure. Moreover, in actual surgery, the accuracy of reconstruction may be significantly deteriorated because the entire surgical field is constantly changing due to the autonomous movement of an organ itself and the deformation of an organ due to contact with surgical instruments such as forceps. Therefore, in order to establish a monocular depth estimation method that is effective even in intraoperative operations where large dynamic changes occur, a learning-based method from a single monocular laparoscopic image from one viewpoint is advantageous. However, although monocular depth estimation for urban and natural images has been actively studied, the realization of monocular depth estimation for clinical applications to computer-assisted surgery, such as surgical navigation in actual surgical environments, is still a challenging and unsolved problem. In particular, there is a significant problem in learning-based monocular depth estimation using an laparoscope for surgical applications. Although it is desirable to train monocular depth estimation models for surgical applications using actual surgical scenes, actual surgical scenes contain various deformations, smoke, and other artifacts, which may generate unreliable supervisory signals and may not yield desirable models. However, to the best of our knowledge, no learning architecture has yet been proposed for monocular depth estimation of dynamic laparoscopic scenes that takes into account the reliability of the supervisory signal.
In light of the existing challenges, this study aims to enhance learning and estimation accuracy within dynamic laparoscopic settings, with a particular emphasis on surgical navigation applications. One of the primary obstacles is the acquisition of a training dataset that accurately represents the real surgical environment. However, genuine surgical images contain substantial noise, making it difficult to establish an accurate ground truth.
To address this issue, we introduce a confidence level for the monitoring signal and propose a confidence-aware learning architecture that allows the utilization of real laparoscopic surgical images as the training dataset. This approach aims to adapt to dynamic laparoscopic scenarios characterized by continuous changes in organ deformation, bleeding, and laparoscope positioning.
This paper presents depth estimation for dynamic monocular laparoscopic scenes using the Hamlyn Dataset8 and evaluates its performance, resulting in an improved overall depth estimation accuracy independent of target motion. The primary contributions of the proposed method include:
-
A platform that allows for depth estimation in the actual surgical field, where obtaining datasets with accurate ground truth is challenging, and enables the acquisition of indirect depth information from stereo endoscopic images during real surgery without requiring precise depth information from Lidar or other sources to be used as a training dataset.
-
mechanism to handle irregularities such as noise in the training dataset by taking not only the depth information but also its confidence as input to enable the selection and weighting of the monitoring signal.
-
We propose a monocular depth estimation model learning architecture that employs a loss function and a stereo vision-based self-monitoring signal module, which outputs disparity and its confidence level. The monocular depth estimation model trained in this manner facilitates learning in dynamic laparoscopic scenes and single-frame monocular depth estimation. This capability is crucial in laparoscopic surgery, where the endoscopic camera is frequently repositioned based on the operating field conditions.
In the field of computer vision, the recent development of DNN has promoted research on learning-based monocular depth estimation methods, with many application examples of urban scenes. Also in the laparoscopic scene, there have been several studies on depth estimation from monocular images. Ye et al.22 and introducing a per-pixel minimum reprojection loss and auto-masking process that removes stationary pixels23. However, according to the latest research by Shao et al.24, the light source often moves in the endoscopic scene, which is very unlikely in the urban scene, so that the light source conditions between images do not match. Moreover, the surface of a living body has small feature amount and strong non-Lambertian reflection may be also generated. It was confirmed that the performance of the conventional SfMLearner, which was aimed for urban scenes, deteriorates under these poor conditions. To improve the performance of SfMLearner for endoscopic scenes, Shao et al.24 extended SfMLearner by introducing appearance flow to align the brightness conditions between frames and adopting a feature scaling module to refine the feature expression, and successfully adapted the method to the endoscopic scene.
These methods are extremely effective for applications such as endoscopic diagnosis of static sites. However, there are problems with its application to dynamic surgical scenes. The conventional methods have been designed on the premise that most of the images are from static tissues which do not include deformation from interaction with forceps, so that learning with a stationary and clear dataset may be possible. Meanwhile, laparoscopic images obtained during actual surgery contain deformation of an organ itself, spatial changes in objects due to bleeding and excision, and dynamic deformation and artifacts due to interaction with forceps. If we consider applications in actual dynamic surgical scenes, image sequences obtained from actual surgical scenes may be desirable for develo** an estimation model. However, data available from surgery are often noisy and less accurate than those obtained in a static environment. Therefore, to realize learning using data from surgical scenes, a novel architecture is required where learning can be performed without being affected by dynamic deformation of an object reflected in the dataset, and can be controlled by selecting/removing and weighting the information based on its accuracy.
Meanwhile, in the field of computer vision where CNN has been the method of choice, a new technique called Vision Transformer (ViT)25 has been recently introduced to show similar or better performance compared with the state of the art methods. ViT was derived from Transformer26, which was quite successful in natural language processing. In particular, the Dense Prediction Transformer (DPT)27 was proposed as a dense monocular depth estimation architecture that uses ViT as the backbone, and exhibits performance equal to or better than the conventional monocular depth estimation model that employs CNN as its backbone. Since ViT has a higher globality than CNN, it is expected to show superior performance in monocular depth estimation in the laparoscopic scene where detailed and consistent estimation of the entire image is required. To the best of our knowledge, there are still no cases where ViT has been adopted as the backbone of the monocular depth estimation model in the laparoscopic scene.
Since application in surgical scenes is assumed where the estimation model has to cope with dynamic laparoscopic scenes distorted by various deformations and artifacts such as smoke, it would be practical to provide supervisory signals as stereo vision for training a monocular depth estimation model that does not depend on the dynamics of an object. However, the depth estimation using the conventional stereo vision only provides disparity prediction without any consideration of its confidence information. Therefore, it was difficult to flexibly handle outliers and select measurement results depending on their confidence level. Consequently, outliers and inaccurate signals were undesirably included in supervisory signals, prohibiting learning and leading to a poor estimation model in some cases. However, a stereo matching method that provides confidence information as well as disparity has been recently proposed28,29. Particularly, STereo TRansformer (STTR)29 provides reliability of prediction while improving accuracy by employing ViT for matching. It has shown high accuracy results for SCARED30, which is a dataset of the medical scene from laparoscopic surgery. Such advancement in stereo vision has made it possible to measure ground truth with higher accuracy than before even in dynamic laparoscopic scenes by appropriately filtering out outliers according to confidence information. It has also enabled the generation of supervisory signals which incorporate not only depth information but also its confidence. This suggests that not only can high-precision disparity be used as a learning resource for the monocular depth estimation model, but also confidence of prediction can be used as the learning resource. As a result, even in the learning of dynamic laparoscopic scenes where outliers are likely to be included in depth supervisory signals, robust and efficient learning can be performed to provide high accuracy estimation if signals are processed appropriately according to the confidence information.
However, conventional architectures that utilize only depth information as a supervisory signal still do not consider its uncertainty, failing to make the best use of the confidence information. Moreover, a learning architecture for a monocular depth estimation model that takes advantage of the confidence information of the supervisory signal is yet to be proposed in the monocular depth estimation task for the laparoscopic scene.
Methods
In learning methods for monocular depth estimation models that simultaneously learn scale and shift parameters, such as those used in previous studies focusing on endoscopy, overfitting occurs due to biases in the training data caused by camera-specific characteristics and the shooting environment used to generate the data31. Therefore, various scale and shift-invariant losses have been introduced in recent studies and the state of the art performance on monocular depth estimation is achieved32. Specifically, acquiring scale-shift-invariant models is crucial for monocular depth estimation in dynamic settings like real laparoscopic surgery, offering two primary advantages in clinical applications.
Firstly, these models can be trained on diverse datasets featuring various organs, surgical techniques, and endoscopic cameras. The normalization of relative depth features in relation to scale and shift allows for training on a combination of in vivo datasets from different environments, resulting in a versatile monocular depth estimation model. Furthermore, the capacity to utilize a mix of datasets as training data enables the effective use of in vivo datasets, which are typically scarce and challenging to obtain, thus expanding the dataset.
Secondly, the distance between the laparoscope and the subject constantly changes in actual laparoscopic surgical settings, causing significant variations in scale and shift parameters. The scale and shift values of objects in the operating field differ between surgeries and over time, leading to substantial deviations from the training data. For instance, when suturing blood vessels, the suture area is the focal point, whereas when handling organs or navigating the surgical field, the entire abdominal cavity is viewed from a narrowed perspective to avoid instrument contact with other organs and potential damage. Therefore, learning scale and shift parameters considerably reduce generalizability, even more so than depth estimation for landscapes and buildings, as seen in autonomous operations. Consequently, it is vital to learn relative depth features independent of scale and shift parameters and develop scale- and shift-invariant monocular depth estimation models for various surgical procedures.
Therefore, we proposes a scale and shift-invariant loss function that takes into account the confidence of self-supervisory signals to appropriately learn relative depth information in a manner independent of the scale and shift parameters, and learns it on the assumption that a flow to restore the estimated relative depth to absolute depth will be applied later. This method enables us to obtain a monocular depth estimation model suitable for application in surgical environments.
In the following, we describe a method for learning monocular depth estimation models using disparity and its confidence estimated from stereo laparoscope images as self-supervisory signals in order to achieve effective learning in dynamic laparoscope scenes and to achieve dense monocular depth estimation from a single laparoscope image. The overall training architecture is shown in Fig. 1. It consists of a monocular depth estimation model that actually does the training, a self-supervisory signal module composed of stereo vision that outputs disparity and its confidence, and a confidence-aware loss module that includes a scale and shift-invariant loss with disparity and its confidence as input.
The proposed network architecture. The network in the training phase (top) consists of a stereo vision-based self-supervisory signal module that provides disparity and its confidence as self-supervisory signals, a monocular depth estimation module that actually performs the learning, and a newly designed scale and shift-invariant loss that takes disparity and its confidence as input. In the application phase (bottom), monocular depth estimation is performed using only the learned monocular depth estimation module with a single image as input.
Confidence-aware loss module
Confidence
The self-supervisory signal module outputs the estimated disparity and corresponding confidence from a pair of stereo laparoscope images as a real number between 0 and 1. In the following, confidence are denoted by \(q\in \left[ 0,1\right] \ \). Here, disparity is calculated for every pixel in the training data image, ultimately yielding a disparity and confidence pair for each pixel.
Confidence-weighted scale and shift-invariant loss
In this section, we extend the scale and shift-invariant loss proposed in Ranftl et al.33 and propose a loss weighted according to the confidence of the self-supervisory signal. First, let M be the number of pixels in the image that have valid ground truth, and let \(\theta \) be a parameter of the prediction model. Let \(d=d(\theta )\in {\mathbb {R}}^M\) be the estimated disparity and \(d^*\in {\mathbb {R}}^M\) be the ground truth of the corresponding estimation. To define a scale and shift-invariant loss, we first need to properly fit the scale and shift of the predictions and ground truth. The alignment of the scale and shift between the predictions and ground truth is then performed based on a least-squares criterion:
where \(s\in {\mathbb {R}}_+\) is the scale parameter, \(t\in {\mathbb {R}}\) is the shift parameter, and \({\hat{d}}_i, {\hat{d}}_i^*\in {\mathbb {R}}\) are the aligned prediction and ground truth. The value corresponding to each pixel i is denoted by the subscript i. Then, with \(w\left( q\right) \) as the weight corresponding to the confidence q of the self-supervisory signal of the pixel obtained from the self-supervisory signal module, we define the confidence-weighted scale and shift-invariant loss \(L_{cwssimse}\) for a single image as
where \(\hat{{\textbf{d}}}, \hat{{\textbf{d}}}^*, {\textbf{q}}\) are define as \(\hat{{\textbf{d}}} = ({\hat{d}}_1,{\hat{d}}_2,\ldots ,{\hat{d}}_M)^{T}, \hat{{\textbf{d}}}^*= ({\hat{d}}_1^*,{\hat{d}}_2^*,\ldots ,{\hat{d}}_M^*)^{T}, {\textbf{q}} = ({q_1},{q_2},\ldots ,{q_M})^{T}\) to represent the inputs collectively.
Furthermore, we weight the multi-scale shift-invariant gradient matching term33 adapted to the disparity space according to the confidence and define a new gradient matching term \(L_{cwreg}\):
where k denotes the scale level at which the image resolution is halved for each level, and \(R_i^k\) is the difference in disparity at each scale k for pixel i. There are K scale levels and the value of K is set to \(K=4\) as in33. From the above, the final loss \(L_{cwssi}\) for the training set l is
where \({\hat{{\textbf{d}}}}^{n}, {\hat{{\textbf{d}}}}^{*n}, {\textbf{q}}^{n}\) denotes \({\hat{{\textbf{d}}}}, {\hat{{\textbf{d}}}}^{*}, {\textbf{q}}\) in each image set n, respectively, \(N_l\) is the size of the training set and \(\alpha \) is set to 0.5.
Weight functions
In this section, we propose two types of weight functions \(w\left( q\right) \).
Hard mask
Under the design concept that self-supervisory signals with confidence below a certain threshold \(\theta \in \left[ 0,1\right] \ \) are ignored as input, those above \(\theta \) are input to the loss function as uniform information for learning, we define the weight \(w\left( q\right) \) as
In this paper, this weight function is defined as Hard Mask.
Soft mask
Under the design concept that self-supervisory signals with confidence below a certain threshold \(\theta \) are ignored as input, and those above \(\theta \) are input to the loss function as information with importance according to the confidence for learning, we define the weight \(w\left( q\right) \) as
where \(\lambda \in {\mathbb {R}}_+\) is a hyperparameter. The weights are designed so that the higher the confidence of the self-supervisory signals, the larger the weights. The weights are weighted against the squared error between the predictions and the self-supervisory signal, and are designed to make learning more efficient by making it easier to capture accurate features. On the other hand, the weights are also designed to improve robustness by including features with low confidence in the self-supervisory signal, such as artifact-ridden areas, by giving them small weights rather than excluding them entirely. In this paper, this weight function is defined as Soft Mask.
Monocular depth estimation module
This section describes the details of the monocular depth estimation module. In the monocular depth estimation module, DPT is used as the monocular depth estimation model and trained with a self-supervisory signal provided by the self-supervisory signal module. By using ViT as the encoder, DPT has a higher globality than other monocular depth estimation models that use CNN as the encoder, and achieves detailed and consistent depth estimation across the entire image. Because it is particularly important to achieve detailed and consistent depth estimation across the entire image in dense depth estimation for laparoscopic scenes, DPT is used as the monocular depth estimation model. DPT provides depth predictions in inverse depth space. In this paper, we employ DPT-Hybrid27 as the encoder of DPT. Stereo in vivo images obtained from laparoscopes during surgery and examinations are generally small in number, which reduces the number of training data that can be used for fine tuning. For this reason, we employ DPT-Hybrid, which has shown significant performance in fine tuning on small data sets.
Self-supervisory signal module based on stereo vision with confidence
This section describes a self-supervisory signal module based on stereo vision that provides self-supervisory signals with confidence. It is practical to use stereo vision as supervisory signal provider to train a monocular depth estimation model in a dynamic laparoscopic scene.That is because, in principle, stereo vision is independent of object motion. However, in general, since the object is often close to the laparoscope, the disparity between stereo laparoscope images is large, and matching over a wide disparity range is required for depth estimation by stereo vision. In addition, the accuracy of some stereo vision depth estimation results may deteriorate or invalid disparity may be output due to the possibility of occlusion areas and artifacts such as smoke in intraoperative laparoscopic images. Therefore, in order to use the disparity obtained from stereo images as a self-supervisory signal, it is an important issue to obtain the confidence corresponding to the estimated disparity in order to appropriately weight or exclude those with low accuracy in consideration of the accuracy of the estimation.
Based on the above, this paper employs STTR as a self-supervisory signal module. Unlike conventional stereo depth estimation methods that only use similarities between pixel intensities as matching criteria, STTR estimates disparity by matching based on the attention calculated from ViT, which alleviates the disadvantage of conventional methods in that it is limited to a fixed disparity range. In fact, STTR has shown high accuracy on SCARED In this study, we proposed a method for learning a dense monocular depth estimation model that employs a Vision Transformer as its backbone to utilize binocular disparity obtained from stereo images and its confidence information. The purpose of this method is to perform monocular depth estimation in dynamic laparoscopic scenes involving deformation and movement of the surgical field. By introducing a loss function that enables weighting and exclusion of self-monitoring signals based on confidence information, accurate depth estimation was achieved while utilizing actual surgical images as training datasets. This approach significantly lowers the hurdle in preparing the training dataset for depth estimation. The framework presented here is not only applicable to the depth estimation architecture proposed in this paper, but also to other methods where real surgical images can be employed as training data. The results of this study indicates the effectiveness and high generalization capability of the proposed method for image inputs with varying scene dynamics, laparoscopic procedures. Currently, the method provides relative depth rather than absolute depth. As a future challenge, it is necessary to integrate the scaling flow of the proposed method to a real-space scale. This integration is expected to lead to applications in autonomous control of surgical assist robots and surgical navigation. Since the ultimate users of our proposed method are humans, future work will need to consider not just the accuracy of the depth estimates but also factors such as the ease with which clinicians can interpret these estimates during surgery.Conclusion
Data availability
The datasets during and/or analyzed during the current study are available from the corresponding authors on reasonable request.
References
Higgins, R. M., Frelich, M. J., Bosler, M. E. & Gould, J. C. Cost analysis of robotic versus laparoscopic general surgery procedures. Surg. Endosc. 31, 185–192. https://doi.org/10.1007/s00464-016-4954-2 (2017).
Maier-Hein, L. et al. Optical techniques for 3d surface reconstruction in computer-assisted laparoscopic surgery. Med. Image Anal. 17, 974–996. https://doi.org/10.1016/j.media.2013.04.003 (2013).
Pelanis, E. et al. Evaluation of a novel navigation platform for laparoscopic liver surgery with organ deformation compensation using injected fiducials. Med. Image Anal. 69, 101946. https://doi.org/10.1016/j.media.2020.101946 (2021).
von Atzigen, M. et al. Marker-free surgical navigation of rod bending using a stereo neural network and augmented reality in spinal fusion. Med. Image Anal. 77, 102365. https://doi.org/10.1016/j.media.2022.102365 (2022).
Bernhardt, S., Nicolau, S. A., Soler, L. & Doignon, C. The status of augmented reality in laparoscopic surgery as of 2016. Med. Image Anal. 37, 66–90. https://doi.org/10.1016/j.media.2017.01.007 (2017).
Hirschmuller, H. Accurate and efficient stereo processing by semi-global matching and mutual information. In Proc. IEEE Conf. Comput. Vis. Pattern Recog., vol. 2, 807–814, https://doi.org/10.1109/CVPR.2005.56 (2005).
Geiger, A., Roser, M. & Urtasun, R. Efficient large-scale stereo matching. In Proc. Asian Conf. Comput. Vis., 25–38, https://doi.org/10.1007/978-3-642-19315-6_3 (2010).
Ye, M. et al. Self-supervised siamese learning on stereo image pairs for depth estimation in robotic surgery (2017). Presented at HSMR https://doi.org/10.48550/ar**v.1705.08260.
Recasens, D., Lamarca, J., Fácil, J. M., Montiel, J. & Civera, J. Endo-depth-and-motion: Reconstruction and tracking in endoscopic videos using depth networks and photometric constraints. IEEE Robot. Autom. Lett. 6, 7225–7232. https://doi.org/10.1109/LRA.2021.3095528 (2021).
Song, J., Wang, J., Zhao, L., Huang, S. & Dissanayake, G. Dynamic reconstruction of deformable soft-tissue with stereo scope in minimal invasive surgery. IEEE Robot. Autom. Lett. 3, 155–162. https://doi.org/10.1109/LRA.2017.2735487 (2018).
Song, J., Wang, J., Zhao, L., Huang, S. & Dissanayake, G. Mis-slam: Real-time large-scale dense deformable slam system in minimal invasive surgery based on heterogeneous computing. IEEE Robot. Autom. Lett. 3, 4068–4075. https://doi.org/10.1109/LRA.2018.2856519 (2018).
Zhang, L., Ye, M., Giataganas, P., Hughes, M. & Yang, G.-Z. Autonomous scanning for endomicroscopic mosaicing and 3d fusion. In Proc Int. Conf. Robot. Autom., 3587–3593, https://doi.org/10.1109/ICRA.2017.7989412 (2017).
Zbontar, J. & LeCun, Y. Stereo matching by training a convolutional neural network to compare image patches. J. Mach. Learn Res. 17, 1–32 (2016).
Chang, J.-R. & Chen, Y.-S. Pyramid stereo matching network. In Proc. IEEE Conf. Comput. Vis. Pattern Recog., 5410–5418, https://doi.org/10.1109/CVPR.2018.00567 (2018).
Guo, X., Yang, K., Yang, W., Wang, X. & Li, H. Group-wise correlation stereo network. In Proc. IEEE Conf. Comput. Vis. Pattern Recog., 3268–3277, https://doi.org/10.1109/CVPR.2019.00339 (2019).
Koishi, T., Sasaki, M., Nakaguchi, T., Tsumura, N. & Miyake, Y. Endoscopy system for length measurement by manual pointing with an electromagnetic tracking sensor. Opt. Rev. 17, 54–60. https://doi.org/10.1007/s10043-010-0010-y (2010).
Leonard, S. et al. Evaluation and stability analysis of video-based navigation system for functional endoscopic sinus surgery on in vivo clinical data. IEEE Trans. Med. Imaging 37, 2185–2195. https://doi.org/10.1109/TMI.2018.2833868 (2018).
Grasa, Ó. G., Bernal, E., Casado, S., Gil, I. & Montiel, J. M. M. Visual slam for handheld monocular endoscope. IEEE Trans. Med. Imaging 33, 135–146. https://doi.org/10.1109/TMI.2013.2282997 (2014).
Noh, H., Hong, S. & Han, B. Learning deconvolution network for semantic segmentation. In Proc. Int. Conf. Comput. Vis., 1520–1528, https://doi.org/10.1109/ICCV.2015.178 (2015).
Liu, X. et al. Dense depth estimation in monocular endoscopy with self-supervised learning methods. IEEE Trans. Med. Imaging 39, 1438–1447. https://doi.org/10.1109/TMI.2019.2950936 (2020).
Zhou, T., Brown, M., Snavely, N. & Lowe, D. G. Unsupervised learning of depth and ego-motion from video. In Proc. IEEE Conf. Comput. Vis. Pattern Recog., 6612–6619, https://doi.org/10.1109/CVPR.2017.700 (2017).
Zhan, H. et al. Unsupervised learning of monocular depth estimation and visual odometry with deep feature reconstruction. In Proc. IEEE Conf. Comput. Vis. Pattern Recog., 340–349, https://doi.org/10.1109/CVPR.2018.00043 (2018).
Godard, C., Aodha, O. M., Firman, M. & Brostow, G. Digging into self-supervised monocular depth estimation. In Proc. Int. Conf. Comput. Vis., 3827–3837, https://doi.org/10.1109/ICCV.2019.00393 (2019).
Shao, S. et al. Self-supervised monocular depth and ego-motion estimation in endoscopy: Appearance flow to the rescue. Med. Image Anal. 77, 102338. https://doi.org/10.1016/j.media.2021.102338 (2022).
Dosovitskiy, A. et al. An image is worth 16x16 words: Transformers for image recognition at scale (2021). Presented at ICLR, https://openreview.net/forum?id=YicbFdNTTy.
Vaswani, A. et al. Attention is all you need. In Proc. Adv. Neural Inf. Process Syst., 5998–6008 (2017). https://proceedings.neurips.cc/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf.
Ranftl, R., Bochkovskiy, A. & Koltun, V. Vision transformers for dense prediction. In Proc. Int. Conf. Comput. Vis., 12159–12168, https://doi.org/10.1109/ICCV48922.2021.01196 (2021).
Shaked, A. & Wolf, L. Improved stereo matching with constant highway networks and reflective confidence learning. In Proc. IEEE Conf. Comput. Vis. Pattern Recog., 6901–6910, https://doi.org/10.1109/CVPR.2017.730 (2017).
Li, Z. et al. Revisiting stereo depth estimation from a sequence-to-sequence perspective with transformers. In Proc. Int. Conf. Comput. Vis., 6177–6186, https://doi.org/10.1109/ICCV48922.2021.00614 (2021).
Allan, M. et al. Stereo correspondence and reconstruction of endoscopic data challenge (2021). https://doi.org/10.48550/ar**v.2101.01133.
Facil, J. M. et al. Cam-convs: Camera-aware multi-scale convolutions for single-view depth. In Proc. IEEE Conf. Comput. Vis. Pattern Recog., 11818–11827, https://doi.org/10.1109/CVPR.2019.01210 (2019).
Yin, W. et al. Learning to recover 3d scene shape from a single image. In Proc. IEEE Conf. Comput. Vis. Pattern Recog., 204–213, https://doi.org/10.1109/CVPR46437.2021.00027 (2021).
Ranftl, R., Lasinger, K., Hafner, D., Schindler, K. & Koltun, V. Towards robust monocular depth estimation: Mixing datasets for zero-shot cross-dataset transfer. IEEE Trans. Pattern Anal. Mach. Intell. 44, 1623–1637. https://doi.org/10.1109/TPAMI.2020.3019967 (2022).
Yin, Z. & Shi, J. Geonet: Unsupervised learning of dense depth, optical flow and camera pose. In Proc. IEEE Conf. Comput. Vis. Pattern Recog., 1983–1992, https://doi.org/10.1109/CVPR.2018.00212 (2018).
Holland, P. W. & Welsch, R. E. Robust regression using iteratively reweighted least-squares. Commun. Stat. Theory Methods 6, 813–827. https://doi.org/10.1080/03610927708827533 (1977).
Beaton, A. E. & Tukey, J. W. The fitting of power series, meaning polynomials, illustrated on band-spectroscopic data. Technometrics 16, 147–185. https://doi.org/10.1080/00401706.1974.10489171 (1974).
Sharan, L. et al. Domain gap in adapting self-supervised depth estimation methods for stereo-endoscopy. In Curr. Dir. Biomed. Eng. 6, https://doi.org/10.1515/cdbme-2020-0004 (2020).
Mayer, N. et al. A large dataset to train convolutional networks for disparity, optical flow, and scene flow estimation. In Proc. IEEE Conf. Comput. Vis. Pattern Recog., 4040–4048, https://doi.org/10.1109/CVPR.2016.438 (2016).
Kingma, D. P. & Ba, J. Adam: A method for stochastic optimization (2015). Presented at ICLR, https://doi.org/10.48550/ar**v.1412.6980.
Acknowledgements
This work was supported in part by Japan Society for the Promotion of Science (JSPS) Grant number 21K18074.
Author information
Authors and Affiliations
Contributions
Y.H.: conceptualization; data curation; formal analysis; software; methodology; validation; visualization; writing—original draft; writing—review and editing. M.S. : conceptualization; data curation; formal analysis; supervision; validation; funding acquisition; visualization; writing—original draft; writing—review and editing. T.M.: writing—original draft; writing—review and editing. T.K.: formal analysis, writing—original draft; writing—review and editing. K.K.: supervision; project administration; writing—original draft; writing—review and editing.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher's note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Hirohata, Y., Sogabe, M., Miyazaki, T. et al. Confidence-aware self-supervised learning for dense monocular depth estimation in dynamic laparoscopic scene. Sci Rep 13, 15380 (2023). https://doi.org/10.1038/s41598-023-42713-x
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41598-023-42713-x
- Springer Nature Limited