1 Introduction

Autonomous robots are composed of different modules that allow them to perceive, learn, decide and act within their environment. The perception module processes cues that inform the robot about the appearance and geometry of the environment. Particularly when working in outdoor or underground scenarios, these cues must be robust to unseen phenomenon. A fully autonomous robot must execute all operations, monitor itself and be able to handle all unprecedented events and conditions, such as unexpected objects and debris on the road, unseen environments and adverse weather. Therefore, reliable and robust perception of the surrounding environment is one of the key tasks of autonomous robotics.

Among the main inputs for a perception system are the distances of the robot from multiple points in its environment. This input can be obtained directly from a sensor or estimated by a depth estimation module. Depth estimation can be performed by processing monocular camera images, stereo vision, radar or LiDAR (Light Detector and Ranging) sensors, among others. Although monocular cameras can only be used to generate depth information up-to-a-scale, they are still an important component of a depth estimation system due to their low price and the rich appearance data they provide. Although a monocular camera is small, low-cost and energy efficient, it is very sensitive to changes in the illumination. Additionally, the accuracy and reliability of depth estimation methods based on monocular images is still far from being practical. For instance, the state-of-the-art RGB-based depth prediction methods [63] follow this intuition but require the fused modalities to have similar coverage densities. Our proposed framework provides a procedure for fusing LiDAR and image data independently of the LiDAR data’s density. The main contribution of this paper is a depth regression model that takes both a sparse set of depth samples and RGB images as the inputs and predicts a full-resolution depth map. This is achieved by modelling the problem of fusing low-resolution depth images with high-resolution camera images as a conditional random field (CRF).

The intuition behind our CRF formulation is that depth discontinuities in a scene often co-occur with changes in colour or brightness within the associated camera image. Since the camera image is commonly available at much higher resolution, this insight can be used to enhance the resolution and accuracy of the depth image. A depth map will be produced by our approach using three features as illustrated in Fig. 1. The first one is an RGB colour image from the camera sensor, top image in Fig. 1 (a). The second one is 2D sparse depth map captured by a LiDAR sensor, middle image in Fig. 1 (b). The third feature is a surface-normal map generated from the sparse depth samples, bottom image in Fig. 1 (c).

The rest of this paper is organized as follows. Section 2 reviews related work on depth estimation. Section 3 explains how the LiDAR points and the camera images were registered. In Sect. 4, we first introduce our CRF-Fusion framework, and then, we provide a detailed explanation of the proposed model: the energy potentials that compose our CRF model and its inference machine. The experimental validation, performed on the KITTI dataset, is reported in Sect. 5. Finally, conclusions and directions for future work are listed in Sect. 6.

Fig. 1
figure 1

Input features of our framework. We developed a CRF regression model to predict a dense depth image from a single RGB image, and a set of sparse depth samples: (a), (b) and (c) are the input RGB image, a set of sparse depth samples projected on the image plane and the projected surface normals, respectively

2 Related work

Depth estimation from monocular images is a long-standing problem in computer vision. Early works on depth estimation using RGB images usually relied on handcrafted features and inference on probabilistic graphical models. Classical methods include shape-from-shading [72] and shape-from-defocus [62]. Other early methods were based on hand-tuned models or assumptions about the orientations of the surfaces [3 illustrates their closest neighbours, whose centroids are represented by yellow dots.

Fig. 3
figure 3

Two examples of superpixels and their neighbours. The red dots in A and B are nodes assigned to two different superpixels, while the green lines represent their corresponding nearest neighbours. Note that the super-segmentation used allows a superpixel node to have more than four neighbours

Figure 4 shows the 4 closest neighbours selected. These neighbours are represented by dark green lines and dark yellow dots. Note that for the superpixels located at the corners, respectively, on the edges, only the 2, respectively, 3, closest neighbours need to be found.

Fig. 4
figure 4

Selected 4 nearest neighbours for superpixel nodes (red dots) in Fig. 3. Dark green lines connect nodes with their selected neighbours

4 CRF-based camera–LIDAR fusion for depth estimation

In this paper, depth estimation is formulated as a superpixel-level inference task on a modified conditional random field (CRF). Our proposed model is a multi-sensor extension of the classical pairwise CRF. In this section, we first briefly introduce the CRF model. Then, we show how to fuse the information of an image and a sparse LIDAR point cloud with our novel CRF framework.

4.1 Overview

The conditional random field (CRF) is a type of undirected probabilistic graphical model which is widely used for solving labelling problems. Formally, let \({\textbf{X}}=\left\{ X_{1}, X_{2}, \ldots , X_{N}\right\} \) be a set of discrete random variables to be inferred from an observation or input tensor \({\textbf{Y}}\), which in turn is composed of the observation variables \(c_{i}\) and \(y_{i}\), where i is an index over superpixels. For each superpixel i, the variable \(c_{i}\) corresponds to an observed three-dimensional colour value and \(y_{i}\) is an observed range measurement. The goal of our framework is to infer the depth of each pixel in a single image depicting general scenes. Following the work of [11, 45], we make the common assumption that an image is composed of small homogeneous regions (superpixels) and consider a graphical model composed of nodes defined on superpixels. Note that our framework is flexible and can estimate depth values on either pixels or superpixels. The remaining question is how to parametrize this undirected graph. Because the interaction between adjacent nodes in the graph is not directed, there is no reason to use a standard conditional probability distribution (CPD), in which one represents the distribution over one node given the others. Rather, we need a more symmetric parametrization. Intuitively, we want our model to capture the affinities between the depth estimates of the superpixels in a given neighbourhood. These affinities can be captured as follows: Let \({\tilde{P}}(X,Y)\) be an unnormalized Gibbs joint distribution parametrized as a product of factors \(\Phi \), where

$$\begin{aligned} \Phi =\left\{ \phi _{1}\left( D_{1}\right) , \ldots , \phi _{k}\left( D_{k}\right) \right\} , \end{aligned}$$

and

$$\begin{aligned} {\tilde{P}}(X,Y)=\prod _{i=1}^{m} \phi _{i}\left( D_{i}\right) . \end{aligned}$$

We can then write a conditional probability distribution of the depth estimates X given the observations Y using the previously introduced Gibbs distribution, as follows:

$$\begin{aligned} Pr(X | Y)=\frac{P(X, Y)}{Z(Y)} \end{aligned}$$

where

$$\begin{aligned} Z(Y)=\sum _{X} {\tilde{P}}(X, Y). \end{aligned}$$

Here, Z(Y), also known as ‘the partition function’, works as a normalizing factor which marginalizes X from \({\tilde{P}}(X,Y)\), allowing the calculation of the probability distribution P(X|Y):

$$\begin{aligned} P(X | Y)=\frac{1}{\sum _{X} {\tilde{P}}(X, Y)} {\tilde{P}}(X,Y). \end{aligned}$$

Therefore, similar to conventional CRFs, we model the conditional probability distribution of the data with the following density function:

$$\begin{aligned} {\text {P}}({\textbf{X}} | {\textbf{y}})=\frac{1}{\textrm{Z}({\textbf{Y}})} \exp (-E({\textbf{X}}, {\textbf{Y}})) \end{aligned}$$

where E is the energy function and Z is the partition function defined by

$$\begin{aligned} \textrm{Z}(\textrm{Y})=\int _{\textrm{Y}} \exp \{-E(\textrm{X}, \textrm{Y})\} \textrm{dY}. \end{aligned}$$

Since Z is continuous, this integral equation can be analytically solved. This is different from the discrete case, in which approximation methods need to be applied. To predict the depths of a new image, we solve the following maximum a posteriori (MAP) inference problem:

Fig. 5
figure 5

Illustration of the proposed model. On the top left is a fused view of the image and LIDAR point cloud on superpixels. On the top right are the normal surface map and RGB inputs used in the pairwise potentials. On the top middle is the graph structure of the CRF: The yellow nodes represent the centroids of the image superpixels, and the green branches the connections between them. The outputs of the unary part and the pairwise part are then fed to the CRF structured loss layer, which minimizes the corresponding energy function. On the bottom left is the probabilistic output, a dense depth map and uncertainty estimation map (see text for details)

To simplify the solution for the energy function, one can take the negative logarithm of the left-hand side and right-hand side of the equation of the probability distribution Pr(X|Y): then, the problem of maximizing the conditional probability becomes an energy minimization problem. Therefore, maximizing the probability distribution \({\text {Pr}}({\textbf{X}} | {\textbf{Y}})\) is equivalent to minimizing the corresponding energy function:

$$\begin{aligned} {\textbf{x}}^{\star }=\arg \min _{{\textbf{x}}} E({\textbf{X}}, {\textbf{Y}}). \end{aligned}$$

We formulate the energy function as a typical combination of unary potentials U and pairwise potentials V over the nodes (superpixels) N and edges S of the image x:

$$\begin{aligned} E({\textbf{X}}, {\textbf{Y}})=\sum _{p \in {\mathcal {N}}} U\left( x_{p}, {\textbf{y}}\right) +\sum _{(p, q) \in {\mathcal {S}}} V\left( x_{p}, x_{q}, {\textbf{y}}\right) \end{aligned}$$

The unary term U aims to regress the depth value from a single superpixel. The pairwise term V encourages neighbouring superpixels with similar appearances to take similar depths [11, 23].

4.2 Potential functions

The proposed multi-modal depth estimation model is composed of unary and pairwise potentials. For an input image, which has been over-segmented into n superpixels, we define a unary potential for each superpixel. The pairwise potentials are defined over the four-neighbour vicinity of each superpixel. The unary potentials are built by aggregating all LiDAR observations inside each superpixel. The pairwise part is composed of similarity vectors, each with K components, that measure the agreement between different features of neighbouring superpixel pairs. Therefore, we explicitly model the relations between neighbouring superpixels through pairwise potentials. In the following, we describe the details of the potentials involved in our energy function.

4.2.1 Unary potential

The unary potential is constructed from the LiDAR sensor measurements by considering the least square loss between the estimated \(x_{i}\) and observed \(y_{i}\) depth values:

$$\begin{aligned} \Phi ({\textbf{x}}, {\textbf{y}})= & {} \sum _{i \in {\mathscr {L}}} \sigma _{i}\left( x_{i}-y_{i}\right) ^{2}\\ \Phi ({\textbf{x}},{\textbf{y}})= & {} \Vert {\textbf{W}} ({\textbf{x}}-{\textbf{y}})\Vert ^{2} \end{aligned}$$

where \({\mathscr {L}}\) is the set of indices for which a depth measurement is available, and \(\sigma _{i}\) is a constant weight placed on the depth measurements. This potential measures the quadratic distance between the estimated range X and the measured range Y, where available. Finally, in order to write the unary potential in a more efficient matrix form, we define the diagonal matrix W with entries

$$\begin{aligned} {\textbf{W}}_{i, i}= \left\{ \begin{array}{ll} {\sigma _{i}} &{} \quad {\text {if } i \in {\mathscr {L}}} \\ {0} &{} \quad {\text {otherwise }} \end{array}\right. \end{aligned}$$

4.2.2 Colour pairwise potential

We construct a pairwise potential from K types of similarity observations, each of which enforces smoothness by exploiting colour consistency features of the neighbouring superpixels. This pairwise potential can be written as

$$\begin{aligned} \Psi ^{c}({\textbf{x}}, {\textbf{I}})= & {} \sum _{i} \sum _{j \in {\mathscr {N}}(i)} e_{i, j}\left( x_{i}-x_{j}\right) ^{2}\\ \Psi ^{c}({\textbf{x}}, {\textbf{I}})= & {} \Vert {\textbf{S}} {\textbf{x}}\Vert ^{2} \end{aligned}$$

where I is an RGB image, \({\mathscr {N}}(i)\) is the set of horizontal and vertical neighbours of i, and each row of S represents the weighting factors for pairs of adjacent range nodes. As the edge strength between nodes, we use an exponentiated \(L_{2}\) norm of the difference in pixel appearance.

$$\begin{aligned} e_{i, j}=\exp -\frac{\left\| {\textbf{c}}_{i} -{\textbf{c}}_{j}\right\| ^{2}}{\sigma _{d}^{2}} \end{aligned}$$

where \({\textbf{c}}_{i}\) is the RGB colour vector of pixel i and \(\sigma _{d}\) is a tuning parameter. A small value of \(\sigma _{d}\) increases the sensitivity to changes in the image. Thanks to this potential, the lack of content or features in the RGB image is considered by our model as indicative of a homogeneous depth distribution, in other words, a planar surface.

4.2.3 Surface-normal pairwise potential

The mathematical formulation of this potential is similar to the previous colour potential. However, the surface-normal potential considers surface-normal similarities instead of colour. The weighting factors \(nr_{i, j}\) for this case are formulated using the cosine similarity, which is a measure of the similarity between two nonzero vectors of an inner product space that employs the cosine of the angle between them. The cosine of 0 is 1, and it is less than 1 for any angle in the interval \([0, \pi ]\) radians. It is thus a measurement of orientation instead of magnitude [60]. The cosine of two nonzero vectors can be found by using the Euclidean dot product formula:

$$\begin{aligned} {\textbf{A}} \cdot {\textbf{B}}=\Vert {\textbf{A}}\Vert \Vert {\textbf{B}}\Vert \cos \theta \end{aligned}$$

Therefore, the cosine similarity can be expressed by

$$\begin{aligned} \cos (\theta )=\frac{{\textbf{A}} \cdot {\textbf{B}}}{\Vert {\textbf{A}}\Vert \Vert {\textbf{B}}\Vert } = \frac{\sum _{t=1}^{n} A_{i} B_{i}}{\sqrt{\sum _{t=1}^{n} A_{i}^{2}} \sqrt{\sum _{t=1}^{n} B_{i}^{2}}} \end{aligned}$$

where \(A_i\) and \(B_i\) are the components of vectors A and B, respectively. Finally, we define our surface-normal potential by the following equations.

$$\begin{aligned} \Psi ^{n}({\textbf{x}}, \textbf{In})= & {} \sum _{i} \sum _{j \in {\mathscr {N}}(i)} nr_{i, j}\left( x_{i}-x_{j}\right) ^{2}\\ \Psi ^{n}({\textbf{x}}, \textbf{In})= & {} \Vert {\textbf{P}} {\textbf{x}}\Vert ^{2}\\ nr_{i, j}= & {} \frac{\sum _{t=1}^{n} In_{i} In_{j}}{\sqrt{\sum _{t=1}^{n} In_{i}^{2}} \sqrt{\sum _{t=1}^{n} In_{j}^{2}}} \end{aligned}$$

4.2.4 Depth pairwise potential

This pairwise potential encodes a smoothness prior over depth estimates which encourages neighbouring superpixels in the image to have similar depths. Usually, pairwise potentials are only related to the colour difference between pairs of superpixels. However, depth smoothness is a valid hypothesis which can potentially enhance depth inference. To enforce depth smoothness, a distance-aware Potts model was adopted. Neighbouring points with smaller distances are considered to be more likely to have the same depth. The mathematical formulation of this potential is similar to the colour pairwise potential, as it follows the Potts model:

$$\begin{aligned} \Psi ^{d}({\textbf{x}}, {\textbf{D}})=\sum _{i} \sum _{j \in {\mathscr {N}}(i)} e_{i, j}\left( x_{i}-x_{j}\right) ^{2} \end{aligned}$$

and the weighting factor \(dp_{i, j}\) for this case is formulated as

$$\begin{aligned} dp_{i, j}=\exp -\frac{\left\| {\textbf{p}}_{i} -{\textbf{p}}_{j}\right\| ^{2}}{\sigma _{p}^{2}} \end{aligned}$$

where \({\textbf{p}}_{i}\) is the 3D location vector of the LiDAR point i and \(\sigma _{p}\) is a parameter controlling the strength of enforcing close points to have similar depth values.

4.2.5 Uncertainty potential:

Depth uncertainty estimation is important for refining depth estimation [16, 65], and in safety critical systems [28]. It allows an agent to identify unknowns in an environment in order to reach optimal decisions. Our method provides uncertainties for the estimates of the pixel-wise depths by taking into account the number of LiDAR points present for each superpixel. The uncertainty potential is similar to the unary potential. It is constructed from the number of LiDAR points projected onto a superpixel and employs the following least square loss:

$$\begin{aligned} U^{c}({\textbf{x}}, {\textbf{y}})= & {} \sum _{i \in {\mathscr {L}}} \sigma _{i}\left( x_{i}-unc_{i}\right) ^{2}\\ U^{c}({\textbf{x}}, {\textbf{y}})= & {} \Vert {\textbf{W}} ({\textbf{x}}-\textbf{unc})\Vert ^{2} \end{aligned}$$

where \(\textbf{unc}\) is defined as follows:

$$\begin{aligned} \textbf{unc}_{i, i}= \left\{ \begin{array}{ll} {\sigma _{i}} &{} \quad {\text {if P projected on SPx is 0}}\\ {\psi {i}}&{} \quad {\text {if P projected on SPx is >0 and <2}}\\ {mean} &{} \quad {\text {otherwise}}\end{array}\right. \end{aligned}$$

where P is a 3D point and SPx is a superpixel. In locations with accurate and sufficiently many LiDAR points, the model will produce depth predictions with a high confidence. This uncertainty estimation provides a measure of how confident the model is about the depth estimation. This results in an overall better performance, since uncertain estimates with high uncertainty can be neglected by higher-level tasks that use the estimated depth maps as an input.

4.3 Optimization

With the unary and the pairwise potentials defined, we can now write the energy function as

$$\begin{aligned} E({\textbf{X}}, {\textbf{Y}})= & {} \left( \alpha \right) \Phi ({\textbf{x}}, {\textbf{y}})+(\beta )\Psi ^{c}({\textbf{x}}, {\textbf{I}}) \ldots \nonumber \\{} & {} +\ldots (\gamma )\Psi ^{n}({\textbf{x}}, \textbf{In})+(\delta )\Psi ^{d}({\textbf{x}}, \textbf{In}) \end{aligned}$$
(1)

The scalars \(\alpha \), \(\beta \), \(\gamma \), \(\delta \) \(\in \) [0,1] are weightings for the four terms. We may further expand the unary and pairwise potentials to

$$\begin{aligned}{} & {} \Phi ({\textbf{x}}, {\textbf{y}})= \alpha ({\textbf{x}}^{\textrm{T}} {\textbf{W}}^{\textrm{T}} {\textbf{W}} {\textbf{x}}-2 {\textbf{z}}^{\textrm{T}} {\textbf{W}}^{\textrm{T}} {\textbf{W}} {\textbf{x}}+{\textbf{z}}^{\textrm{T}} {\textbf{W}}^{\textrm{T}} {\textbf{W}} {\textbf{z}}) \end{aligned}$$
(2)
$$\begin{aligned}{} & {} \Psi ^{c}({\textbf{x}}, \textbf{In}) = \beta ({\textbf{x}}^{{\textbf{T}}} {\textbf{S}}^{{\textbf{T}}} {\textbf{S}}_{{\textbf{X}}}) \end{aligned}$$
(3)
$$\begin{aligned}{} & {} \Psi ^{n}({\textbf{x}}, \textbf{In})=\gamma ({\textbf{x}}^{{\textbf{T}}} {\textbf{P}}^{{\textbf{T}}} {\textbf{P}}_{{\textbf{X}}}) \end{aligned}$$
(4)
$$\begin{aligned}{} & {} \Psi ^{d}({\textbf{x}}, \textbf{In})=\delta ({\textbf{x}}^{{\textbf{T}}} {\textbf{D}}^{{\textbf{T}}} {\textbf{D}}_{{\textbf{X}}}) \end{aligned}$$
(5)

We shall pose the problem as one of finding the optimal range vector \({\textbf{x}}^{*}\) such that:

$$\begin{aligned} {\textbf{x}}^{*}=\underset{{\textbf{x}}}{{\text {argmin}}}\left\{ E({\textbf{X}}, {\textbf{Y}}) \right\} \end{aligned}$$

Substituting equations 2, 3, 4 and 5 into 1 and solving for x reduces the problem to: \( \textbf{A x}={\textbf{b}}\) where

$$\begin{aligned} {\textbf{A}}= & {} \alpha ({\textbf{W}}^{{\textbf{T}}}{\textbf{W}})+ \beta ({\textbf{S}}^{{\textbf{T}}}{\textbf{S}}) + \gamma ({\textbf{P}}^{{\textbf{T}}}{\textbf{P}}) + \delta ({\textbf{D}}^{{\textbf{T}}}{\textbf{D}})\\ {\textbf{b}}= & {} \alpha ({\textbf{W}}^{{\textbf{T}}} {\textbf{W}} {\textbf{z}}) \end{aligned}$$

All we need to do to perform the optimization is to solve a large sparse linear system. The methods for solving sparse systems are divided into two categories: direct and iterative. Direct methods are robust but require large amounts of memory as the size of the problem grows. On the other hand, iterative methods provide better performance but may exhibit numerical problems [10, 20]. In the present paper, the fast algorithm conjugate gradient squared proposed by Hestenes and Stiefel [25, 64] is employed to solve the energy minimization problem.

4.4 Pseudo-code

Algorithm 1 provides the complete pseudo-code for our proposed framework, which is previously illustrated in Fig. 5. In this algorithm, lines 1 to 5 perform the preprocessing, which includes gathering the multi-modal raw data, building a connection graph between pairs of adjacent superpixels and projecting the clustered LiDAR points onto the image space. Lines 6 to 11 constitute the core of the approach. They include constructing the cost function using different potentials (unary and pairwise) to obtain the complete CRF for the depth estimation. The objective of the pairwise potentials is to smooth the depth regressed from the unary part based on the neighbouring superpixels. The pairwise potential functions are based on standard CRF vertex and edge feature functions studied extensively in [52] and other papers. Our model uses both the content information of the superpixels and relation information between them to infer the depth.

Algorithm 1
figure d

UAOFusion Network: A multi-modal CRF-based method for Camera–LiDAR depth estimation

5 Results and discussion

We evaluate our approach on the raw sequences of the KITTI benchmark, which is a popular dataset for single image depth map prediction. The sequences contain stereo imagery taken from a car driving in an urban scenario. The dataset also provides 3D laser measurements from a Velodyne laser scanner, which we use as ground truth measurements (projected into the stereo images using the given intrinsics and extrinsics in KITTI). This dataset has been used to train and evaluate the state-of-the-art methods and allows quantitative comparisons. First, we evaluate the prediction accuracy of our proposed method with different potentials in Sect. 5.2. Second, in Sect. 5.3 we explore the impact on the depth estimation of the number of sparse depth samples and the number of superpixels. Third, Sect. 5.5 compares our approach to state-of-the-art methods on the KITTI dataset. Lastly, in Sects. 5.6 and 5.7, we demonstrate two use cases of our proposed algorithm, one for creating LiDAR super-resolution from sensor data provided by the KITTI dataset and another one for a dataset collected in the context of this work.

5.1 Evaluation metrics

We evaluate the accuracy of our method in depth prediction using the 3D laser ground truth on the test images. We use the following depth evaluation metrics: root-mean-squared error (RMSE), mean absolute error (MAE) and mean absolute relative error (REL), among which RMSE is the most important indicator and chosen to rank submissions on the leader-board since it measures error directly on depth and penalizes on further distance where depth measurement is more challenging. These metrics were used by [13, 17, 30, 53] to estimate the accuracy of monocular depth prediction.

$$\begin{aligned} RMSE= & {} \sqrt{\frac{1}{|T|} \sum _{d \in T}\Vert {\hat{d}}-d\Vert ^{2}}\\ MAE= & {} \frac{1}{T} \sum _{d \in T}\Vert {\hat{d}}-d\Vert ^{2}\\ REL= & {} \frac{1}{T} \sum _{d \in T}\left( \frac{\left| {\hat{d}} -d\right| }{{\hat{d}}}\right) \end{aligned}$$

Here, d is the ground truth depth, \({\hat{d}}\) is the estimated depth, and T denotes the set of all points in the test set images. In order to compare our results with those of Eigen et al. [13] and Godard et al. [21], we crop our image to the evaluation crop applied by Eigen et al. We also use the same resolution of the ground truth depth image and cap the predicted depth at 80 m [21].

5.2 Architecture evaluation

This section presents an empirical study of the impact on the accuracy of the depth prediction of different choices for the potential functions and hyperparameters. In the first experiment, we compare the impact of sequentially adding our proposed pairwise potentials. We first evaluate a model with only unary and colour pairwise potentials. Then, we added the surface-normal pairwise potential, and finally, the depth pairwise potential is included. As shown in Table 1, the RMSE is improved after adding each pairwise potential.

Table 1 Depth completion errors [mm] after adding pairwise potentials (lower is better)
Fig. 6
figure 6

Qualitative evaluation of the impact of the pairwise potentials defined as CRF terms. In row order: 1st: pairwise potential I, penalizes dissimilar depth estimates of neighbouring pixels which have similar colours in the RGB image, 2nd: pairwise potential II, penalizes the depth differences between neighbouring superpixels whose normal surface vectors have large cosine similarities, and 3rd: pairwise potential III, penalizes neighbouring superpixels with large observed depth differences

5.3 The number of superpixels

In this section, we explore the relation between the prediction accuracy and the number of available depth samples and the number of superpixels.

Fig. 7
figure 7

Visual comparison of dense depth maps produced by the CRF framework when varying the size of the superpixels. From top to bottom, 1200, 2400 and 5500 superpixels

As displayed in Fig. 5, a greater number of superpixels yields better results in error measurements. Although a larger number of sparse depth observations improves the quality of the depth map, the performance converges when the number of superpixels is more than 5000, which is about 1.5% of the total number of pixels. We ran an exhaustive evaluation of our method for a different amount of superpixels. Figure 8 clearly shows that our method’s error decreases with an increased number of superpixels.

Table 2 Depth completion errors [mm] for different number of superpixels (lower is better)
Fig. 8
figure 8

Convergence for different number of superpixels

5.4 Sub-sampling 3D depth points

We performed a quantitative analysis of the impact of observed 3D point sparsity on the error of our proposed method, by decreasing the amount of 3D points considered during inference. As it is shown in Fig. 9, our method’s error decreases with an increased number of 3D depth points, enabling a better dense depth map estimation. From Fig. 9, we can argue that the amount of 3D depth points is really important for an accurate depth map estimation. Even though our estimation error increases with the sparsity of the depth observations, our method manages to provide state-of-the-art performances even when the depth observations are sampled down to 40%.

Fig. 9
figure 9

Convergence for different number of 3D depth points projected into 2D RGB images. The amount of 3D depth points used is represented as a percentage (\(\%\)). 100% means we are using all the 3D depth points projected into the 2D image

5.5 Algorithm evaluation for depth completion

This is a more challenging dataset than other datasets for depth estimation: the distances in the KITTI dataset are larger than in other datasets, e.g. NYU-Depth-V2 dataset. Hence, the KITTI odometry dataset is more challenging for the depth estimation task. The performance of our method and those of other existing methods on the KITTI dataset are shown in Table 3. Table 3 shows that the proposed method outperforms other depth map estimation approaches which are well accepted in the robotics community. Our model relies on the number of superpixels and the resolution of input data sources. This means that the model’s performance will increase if we increase the number of superpixels, the image resolution and the density of the LiDAR data.

Table 3 Depth completion errors [mm] by different methods on the test set of KITTI depth completion benchmark (lower is better)
Fig. 10
figure 10

Depth completion and uncertainty estimates of our approach on the KITTI raw test set. From top to bottom: RGB and raw depth projected onto the image; high-resolution depth map; raw uncertainty; and estimated uncertainty map

5.6 Algorithm evaluation for LiDAR super-resolution

We present another demonstration of our method in super-resolution of LiDAR measurements. 3D LiDARs have a low vertical angular resolution and thus generate a vertically sparse point cloud. We use all measurements in the sparse depth image and RGB images as input to our framework. An example is shown in Fig. 4. The cars are much more recognizable in the prediction than in the raw scans.

Fig. 11
figure 11

LiDAR super-resolution. Creating dense point clouds from sparse raw measurements. From top to bottom: RGB image, raw depth map, predicted depth and ground truth depth map. Distant cars are almost invisible in the raw depth map, but are easily recognizable in the predicted depth map

On the other hand, starting from a LiDAR super-resolution map we can generate a 3D reconstruction of the scene. The reconstruction of three-dimensional (3D) scenes has many important applications, such as autonomous navigation [24], environmental monitoring [46] and other computer vision tasks [26]. Therefore, a dense and accurate model of the environment is crucial for autonomous vehicles. In fact, imprecise representations of the vehicle’s surrounding may lead to unexpected situations that could endanger the passengers. In this paper, the 3D modelling is generated using a combination of image and range data are a sensor fusion approach that takes the strengths of each in order to overcome their limitations. Images normally have higher resolution and more visual information than range data, and range data are noisy, sparse, and have less visual information, but already contain 3D information. The qualitative and quantitative results presented here suggest that our system provides 3D reconstructions of reasonable quality. Following [35, 41], could be explored as a way of improving our system’s robustness to challenging environmental conditions.