Introduction

The increasing demand for minimally invasive surgery in recent years has lead to the growing popularity of endoscopic surgery including laparoscopic operations1. One of the reasons for the widespread use of laparoscopic surgery is the increased sophistication of information provided by laparoscopes. In particular, with the advent of improved stereo laparoscopes and 4K/8K laparoscopes, it has become possible to acquire spatial information from laparoscopic images during surgery and examination. As a result, early detection of minute lesions, more accurate measurement of tumor size, and more precise grasp of the positional relationship between the organ and surgical instruments during surgery are now feasible. Furthermore, the acquired 3D information is expected to be applied to computer-assisted surgery, as typified by robot control support and surgical navigation2,3,4,5.

Laparoscopes are classified into stereo laparoscopes and monocular laparoscopes according to the number of optical systems at the tip of the scope. The mainstream stereoscopic laparoscope is a stereo laparoscope that incorporates two optical systems at the tip. With a stereo laparoscope, the depth can be estimated by stereo vision from the disparity of the left and right images. Semi-Global block Matching (SGM)6 and Efficient LArge-scale Stereo (ELAS)7 estimate depth from binocular disparity by performing matching on only one set of image pairs, with the latter method having real-time capability. These methods are able to perform depth estimation from a pair of images that are simultaneously acquired from a stereo camera, and reconstruct dense depth information even if the images are time-varying with deformation due to organ excision or gras**. Particularly in the in vivo depth estimation task, ELAS is a typical method that has been substituted as baseline and the ground truth of depth18 estimate depth with monocular images obtained from multiple viewpoints. However, a pseudo-stereo-like depth estimation method has a limit in the imaging speed because a physical acquisition of two images is inherently required. This limitation causes a decrease in frame per second (fps), so that its application to operation may be difficult where dynamic change is large. In addition, while multi-viewpoint depth estimation methods such as SfM and Visual SLAM estimate depth by reconstructing the camera’s posture and the three-dimensional structure of the feature points from the feature points of the image, the dynamic image scene practically obtained within the living body may provide only a scarce and uniform texture compared to general natural images, resulting in very non-uniform and sparse 3D reconstructed structure. Moreover, in actual surgery, the accuracy of reconstruction may be significantly deteriorated because the entire surgical field is constantly changing due to the autonomous movement of an organ itself and the deformation of an organ due to contact with surgical instruments such as forceps. Therefore, in order to establish a monocular depth estimation method that is effective even in intraoperative operations where large dynamic changes occur, a learning-based method from a single monocular laparoscopic image from one viewpoint is advantageous. However, although monocular depth estimation for urban and natural images has been actively studied, the realization of monocular depth estimation for clinical applications to computer-assisted surgery, such as surgical navigation in actual surgical environments, is still a challenging and unsolved problem. In particular, there is a significant problem in learning-based monocular depth estimation using an laparoscope for surgical applications. Although it is desirable to train monocular depth estimation models for surgical applications using actual surgical scenes, actual surgical scenes contain various deformations, smoke, and other artifacts, which may generate unreliable supervisory signals and may not yield desirable models. However, to the best of our knowledge, no learning architecture has yet been proposed for monocular depth estimation of dynamic laparoscopic scenes that takes into account the reliability of the supervisory signal.

In light of the existing challenges, this study aims to enhance learning and estimation accuracy within dynamic laparoscopic settings, with a particular emphasis on surgical navigation applications. One of the primary obstacles is the acquisition of a training dataset that accurately represents the real surgical environment. However, genuine surgical images contain substantial noise, making it difficult to establish an accurate ground truth.

To address this issue, we introduce a confidence level for the monitoring signal and propose a confidence-aware learning architecture that allows the utilization of real laparoscopic surgical images as the training dataset. This approach aims to adapt to dynamic laparoscopic scenarios characterized by continuous changes in organ deformation, bleeding, and laparoscope positioning.

This paper presents depth estimation for dynamic monocular laparoscopic scenes using the Hamlyn Dataset8 and evaluates its performance, resulting in an improved overall depth estimation accuracy independent of target motion. The primary contributions of the proposed method include:

  • A platform that allows for depth estimation in the actual surgical field, where obtaining datasets with accurate ground truth is challenging, and enables the acquisition of indirect depth information from stereo endoscopic images during real surgery without requiring precise depth information from Lidar or other sources to be used as a training dataset.

  • mechanism to handle irregularities such as noise in the training dataset by taking not only the depth information but also its confidence as input to enable the selection and weighting of the monitoring signal.

  • We propose a monocular depth estimation model learning architecture that employs a loss function and a stereo vision-based self-monitoring signal module, which outputs disparity and its confidence level. The monocular depth estimation model trained in this manner facilitates learning in dynamic laparoscopic scenes and single-frame monocular depth estimation. This capability is crucial in laparoscopic surgery, where the endoscopic camera is frequently repositioned based on the operating field conditions.

In the field of computer vision, the recent development of DNN has promoted research on learning-based monocular depth estimation methods, with many application examples of urban scenes. Also in the laparoscopic scene, there have been several studies on depth estimation from monocular images. Ye et al.22 and introducing a per-pixel minimum reprojection loss and auto-masking process that removes stationary pixels23. However, according to the latest research by Shao et al.24, the light source often moves in the endoscopic scene, which is very unlikely in the urban scene, so that the light source conditions between images do not match. Moreover, the surface of a living body has small feature amount and strong non-Lambertian reflection may be also generated. It was confirmed that the performance of the conventional SfMLearner, which was aimed for urban scenes, deteriorates under these poor conditions. To improve the performance of SfMLearner for endoscopic scenes, Shao et al.24 extended SfMLearner by introducing appearance flow to align the brightness conditions between frames and adopting a feature scaling module to refine the feature expression, and successfully adapted the method to the endoscopic scene.

These methods are extremely effective for applications such as endoscopic diagnosis of static sites. However, there are problems with its application to dynamic surgical scenes. The conventional methods have been designed on the premise that most of the images are from static tissues which do not include deformation from interaction with forceps, so that learning with a stationary and clear dataset may be possible. Meanwhile, laparoscopic images obtained during actual surgery contain deformation of an organ itself, spatial changes in objects due to bleeding and excision, and dynamic deformation and artifacts due to interaction with forceps. If we consider applications in actual dynamic surgical scenes, image sequences obtained from actual surgical scenes may be desirable for develo** an estimation model. However, data available from surgery are often noisy and less accurate than those obtained in a static environment. Therefore, to realize learning using data from surgical scenes, a novel architecture is required where learning can be performed without being affected by dynamic deformation of an object reflected in the dataset, and can be controlled by selecting/removing and weighting the information based on its accuracy.

Meanwhile, in the field of computer vision where CNN has been the method of choice, a new technique called Vision Transformer (ViT)25 has been recently introduced to show similar or better performance compared with the state of the art methods. ViT was derived from Transformer26, which was quite successful in natural language processing. In particular, the Dense Prediction Transformer (DPT)27 was proposed as a dense monocular depth estimation architecture that uses ViT as the backbone, and exhibits performance equal to or better than the conventional monocular depth estimation model that employs CNN as its backbone. Since ViT has a higher globality than CNN, it is expected to show superior performance in monocular depth estimation in the laparoscopic scene where detailed and consistent estimation of the entire image is required. To the best of our knowledge, there are still no cases where ViT has been adopted as the backbone of the monocular depth estimation model in the laparoscopic scene.

Since application in surgical scenes is assumed where the estimation model has to cope with dynamic laparoscopic scenes distorted by various deformations and artifacts such as smoke, it would be practical to provide supervisory signals as stereo vision for training a monocular depth estimation model that does not depend on the dynamics of an object. However, the depth estimation using the conventional stereo vision only provides disparity prediction without any consideration of its confidence information. Therefore, it was difficult to flexibly handle outliers and select measurement results depending on their confidence level. Consequently, outliers and inaccurate signals were undesirably included in supervisory signals, prohibiting learning and leading to a poor estimation model in some cases. However, a stereo matching method that provides confidence information as well as disparity has been recently proposed28,29. Particularly, STereo TRansformer (STTR)29 provides reliability of prediction while improving accuracy by employing ViT for matching. It has shown high accuracy results for SCARED30, which is a dataset of the medical scene from laparoscopic surgery. Such advancement in stereo vision has made it possible to measure ground truth with higher accuracy than before even in dynamic laparoscopic scenes by appropriately filtering out outliers according to confidence information. It has also enabled the generation of supervisory signals which incorporate not only depth information but also its confidence. This suggests that not only can high-precision disparity be used as a learning resource for the monocular depth estimation model, but also confidence of prediction can be used as the learning resource. As a result, even in the learning of dynamic laparoscopic scenes where outliers are likely to be included in depth supervisory signals, robust and efficient learning can be performed to provide high accuracy estimation if signals are processed appropriately according to the confidence information.

However, conventional architectures that utilize only depth information as a supervisory signal still do not consider its uncertainty, failing to make the best use of the confidence information. Moreover, a learning architecture for a monocular depth estimation model that takes advantage of the confidence information of the supervisory signal is yet to be proposed in the monocular depth estimation task for the laparoscopic scene.

Methods

In learning methods for monocular depth estimation models that simultaneously learn scale and shift parameters, such as those used in previous studies focusing on endoscopy, overfitting occurs due to biases in the training data caused by camera-specific characteristics and the shooting environment used to generate the data31. Therefore, various scale and shift-invariant losses have been introduced in recent studies and the state of the art performance on monocular depth estimation is achieved32. Specifically, acquiring scale-shift-invariant models is crucial for monocular depth estimation in dynamic settings like real laparoscopic surgery, offering two primary advantages in clinical applications.

Firstly, these models can be trained on diverse datasets featuring various organs, surgical techniques, and endoscopic cameras. The normalization of relative depth features in relation to scale and shift allows for training on a combination of in vivo datasets from different environments, resulting in a versatile monocular depth estimation model. Furthermore, the capacity to utilize a mix of datasets as training data enables the effective use of in vivo datasets, which are typically scarce and challenging to obtain, thus expanding the dataset.

Secondly, the distance between the laparoscope and the subject constantly changes in actual laparoscopic surgical settings, causing significant variations in scale and shift parameters. The scale and shift values of objects in the operating field differ between surgeries and over time, leading to substantial deviations from the training data. For instance, when suturing blood vessels, the suture area is the focal point, whereas when handling organs or navigating the surgical field, the entire abdominal cavity is viewed from a narrowed perspective to avoid instrument contact with other organs and potential damage. Therefore, learning scale and shift parameters considerably reduce generalizability, even more so than depth estimation for landscapes and buildings, as seen in autonomous operations. Consequently, it is vital to learn relative depth features independent of scale and shift parameters and develop scale- and shift-invariant monocular depth estimation models for various surgical procedures.

Therefore, we proposes a scale and shift-invariant loss function that takes into account the confidence of self-supervisory signals to appropriately learn relative depth information in a manner independent of the scale and shift parameters, and learns it on the assumption that a flow to restore the estimated relative depth to absolute depth will be applied later. This method enables us to obtain a monocular depth estimation model suitable for application in surgical environments.

In the following, we describe a method for learning monocular depth estimation models using disparity and its confidence estimated from stereo laparoscope images as self-supervisory signals in order to achieve effective learning in dynamic laparoscope scenes and to achieve dense monocular depth estimation from a single laparoscope image. The overall training architecture is shown in Fig. 1. It consists of a monocular depth estimation model that actually does the training, a self-supervisory signal module composed of stereo vision that outputs disparity and its confidence, and a confidence-aware loss module that includes a scale and shift-invariant loss with disparity and its confidence as input.

Figure 1
figure 1

The proposed network architecture. The network in the training phase (top) consists of a stereo vision-based self-supervisory signal module that provides disparity and its confidence as self-supervisory signals, a monocular depth estimation module that actually performs the learning, and a newly designed scale and shift-invariant loss that takes disparity and its confidence as input. In the application phase (bottom), monocular depth estimation is performed using only the learned monocular depth estimation module with a single image as input.

Confidence-aware loss module

Confidence

The self-supervisory signal module outputs the estimated disparity and corresponding confidence from a pair of stereo laparoscope images as a real number between 0 and 1. In the following, confidence are denoted by \(q\in \left[ 0,1\right] \ \). Here, disparity is calculated for every pixel in the training data image, ultimately yielding a disparity and confidence pair for each pixel.

Confidence-weighted scale and shift-invariant loss

In this section, we extend the scale and shift-invariant loss proposed in Ranftl et al.33 and propose a loss weighted according to the confidence of the self-supervisory signal. First, let M be the number of pixels in the image that have valid ground truth, and let \(\theta \) be a parameter of the prediction model. Let \(d=d(\theta )\in {\mathbb {R}}^M\) be the estimated disparity and \(d^*\in {\mathbb {R}}^M\) be the ground truth of the corresponding estimation. To define a scale and shift-invariant loss, we first need to properly fit the scale and shift of the predictions and ground truth. The alignment of the scale and shift between the predictions and ground truth is then performed based on a least-squares criterion:

$$\begin{aligned}{} & {} \left( s,t\right) =\mathop {\text {arg min}}\limits _{s,t}{\sum _{i=1}^{M}{(s{d_i}+t-d_i^*)}^2}, \end{aligned}$$
(1)
$$\begin{aligned}{} & {} {\hat{d}}_i=s{d_i}+t,\ \ {{\hat{d}}_i^*}={d_i^*}, \end{aligned}$$
(2)

where \(s\in {\mathbb {R}}_+\) is the scale parameter, \(t\in {\mathbb {R}}\) is the shift parameter, and \({\hat{d}}_i, {\hat{d}}_i^*\in {\mathbb {R}}\) are the aligned prediction and ground truth. The value corresponding to each pixel i is denoted by the subscript i. Then, with \(w\left( q\right) \) as the weight corresponding to the confidence q of the self-supervisory signal of the pixel obtained from the self-supervisory signal module, we define the confidence-weighted scale and shift-invariant loss \(L_{cwssimse}\) for a single image as

$$\begin{aligned} L_{cwssimse}\left( \hat{{\textbf{d}}},{\hat{{\textbf{d}}}}^*,{\textbf{q}}\right)&=\frac{1}{2M}\sum _{i=1}^{M}w_i(q_i)\left\Vert {\hat{d}}_i-{\hat{d}}_i^*\right\Vert ^2, \end{aligned}$$
(3)

where \(\hat{{\textbf{d}}}, \hat{{\textbf{d}}}^*, {\textbf{q}}\) are define as \(\hat{{\textbf{d}}} = ({\hat{d}}_1,{\hat{d}}_2,\ldots ,{\hat{d}}_M)^{T}, \hat{{\textbf{d}}}^*= ({\hat{d}}_1^*,{\hat{d}}_2^*,\ldots ,{\hat{d}}_M^*)^{T}, {\textbf{q}} = ({q_1},{q_2},\ldots ,{q_M})^{T}\) to represent the inputs collectively.

Furthermore, we weight the multi-scale shift-invariant gradient matching term33 adapted to the disparity space according to the confidence and define a new gradient matching term \(L_{cwreg}\):

$$\begin{aligned} L_{cwreg}\left( \hat{{\textbf{d}}},{\hat{{\textbf{d}}}}^*,{\textbf{q}}\right)&=\frac{1}{M}\sum _{k=1}^{K}\sum _{i=1}^{M}{w_i(q_i)\left( \left| \mathrm {\nabla }_xR_i^k\right| +\left| \mathrm {\nabla }_yR_i^k\right| \right) }, \end{aligned}$$
(4)
$$\begin{aligned} R_i^k&={\hat{d}}_i-{\hat{d}}_i^*\ (k: scale\ level), \end{aligned}$$
(5)

where k denotes the scale level at which the image resolution is halved for each level, and \(R_i^k\) is the difference in disparity at each scale k for pixel i. There are K scale levels and the value of K is set to \(K=4\) as in33. From the above, the final loss \(L_{cwssi}\) for the training set l is

$$\begin{aligned} \begin{aligned} L_{cwssi}&= \frac{1}{N_l}\sum _{n=1}^{N_l}{L_{cwssimse}\left( {\hat{{\textbf{d}}}}^{n},{\hat{{\textbf{d}}}}^{*n},{\textbf{q}}^{n}\right) } \\&\quad + \alpha \frac{1}{N_l}\sum _{n=1}^{N_l}{L_{cwreg}\left( {\hat{{\textbf{d}}}}^{n},{\hat{{\textbf{d}}}}^{*n},{\textbf{q}}^{n}\right) }, \end{aligned} \end{aligned}$$
(6)

where \({\hat{{\textbf{d}}}}^{n}, {\hat{{\textbf{d}}}}^{*n}, {\textbf{q}}^{n}\) denotes \({\hat{{\textbf{d}}}}, {\hat{{\textbf{d}}}}^{*}, {\textbf{q}}\) in each image set n, respectively, \(N_l\) is the size of the training set and \(\alpha \) is set to 0.5.

Weight functions

In this section, we propose two types of weight functions \(w\left( q\right) \).

Hard mask

Under the design concept that self-supervisory signals with confidence below a certain threshold \(\theta \in \left[ 0,1\right] \ \) are ignored as input, those above \(\theta \) are input to the loss function as uniform information for learning, we define the weight \(w\left( q\right) \) as

$$\begin{aligned} w(q) = {\left\{ \begin{array}{ll} 1 &{}\ (\theta \le q \le 1)\\ 0 &{}\ (0 \le q < \theta ) . \end{array}\right. } \end{aligned}$$
(7)

In this paper, this weight function is defined as Hard Mask.

Soft mask

Under the design concept that self-supervisory signals with confidence below a certain threshold \(\theta \) are ignored as input, and those above \(\theta \) are input to the loss function as information with importance according to the confidence for learning, we define the weight \(w\left( q\right) \) as

$$\begin{aligned} w(q) = {\left\{ \begin{array}{ll} e^{\lambda \left( q-1\right) } &{}\ (\theta \le q \le 1)\\ 0 &{}\ (0 \le q < \theta ) , \end{array}\right. } \end{aligned}$$
(8)

where \(\lambda \in {\mathbb {R}}_+\) is a hyperparameter. The weights are designed so that the higher the confidence of the self-supervisory signals, the larger the weights. The weights are weighted against the squared error between the predictions and the self-supervisory signal, and are designed to make learning more efficient by making it easier to capture accurate features. On the other hand, the weights are also designed to improve robustness by including features with low confidence in the self-supervisory signal, such as artifact-ridden areas, by giving them small weights rather than excluding them entirely. In this paper, this weight function is defined as Soft Mask.

Monocular depth estimation module

This section describes the details of the monocular depth estimation module. In the monocular depth estimation module, DPT is used as the monocular depth estimation model and trained with a self-supervisory signal provided by the self-supervisory signal module. By using ViT as the encoder, DPT has a higher globality than other monocular depth estimation models that use CNN as the encoder, and achieves detailed and consistent depth estimation across the entire image. Because it is particularly important to achieve detailed and consistent depth estimation across the entire image in dense depth estimation for laparoscopic scenes, DPT is used as the monocular depth estimation model. DPT provides depth predictions in inverse depth space. In this paper, we employ DPT-Hybrid27 as the encoder of DPT. Stereo in vivo images obtained from laparoscopes during surgery and examinations are generally small in number, which reduces the number of training data that can be used for fine tuning. For this reason, we employ DPT-Hybrid, which has shown significant performance in fine tuning on small data sets.

Self-supervisory signal module based on stereo vision with confidence

This section describes a self-supervisory signal module based on stereo vision that provides self-supervisory signals with confidence. It is practical to use stereo vision as supervisory signal provider to train a monocular depth estimation model in a dynamic laparoscopic scene.That is because, in principle, stereo vision is independent of object motion. However, in general, since the object is often close to the laparoscope, the disparity between stereo laparoscope images is large, and matching over a wide disparity range is required for depth estimation by stereo vision. In addition, the accuracy of some stereo vision depth estimation results may deteriorate or invalid disparity may be output due to the possibility of occlusion areas and artifacts such as smoke in intraoperative laparoscopic images. Therefore, in order to use the disparity obtained from stereo images as a self-supervisory signal, it is an important issue to obtain the confidence corresponding to the estimated disparity in order to appropriately weight or exclude those with low accuracy in consideration of the accuracy of the estimation.

Based on the above, this paper employs STTR as a self-supervisory signal module. Unlike conventional stereo depth estimation methods that only use similarities between pixel intensities as matching criteria, STTR estimates disparity by matching based on the attention calculated from ViT, which alleviates the disadvantage of conventional methods in that it is limited to a fixed disparity range. In fact, STTR has shown high accuracy on SCARED

Conclusion

In this study, we proposed a method for learning a dense monocular depth estimation model that employs a Vision Transformer as its backbone to utilize binocular disparity obtained from stereo images and its confidence information. The purpose of this method is to perform monocular depth estimation in dynamic laparoscopic scenes involving deformation and movement of the surgical field. By introducing a loss function that enables weighting and exclusion of self-monitoring signals based on confidence information, accurate depth estimation was achieved while utilizing actual surgical images as training datasets. This approach significantly lowers the hurdle in preparing the training dataset for depth estimation. The framework presented here is not only applicable to the depth estimation architecture proposed in this paper, but also to other methods where real surgical images can be employed as training data. The results of this study indicates the effectiveness and high generalization capability of the proposed method for image inputs with varying scene dynamics, laparoscopic procedures. Currently, the method provides relative depth rather than absolute depth. As a future challenge, it is necessary to integrate the scaling flow of the proposed method to a real-space scale. This integration is expected to lead to applications in autonomous control of surgical assist robots and surgical navigation. Since the ultimate users of our proposed method are humans, future work will need to consider not just the accuracy of the depth estimates but also factors such as the ease with which clinicians can interpret these estimates during surgery.