Probabilistic multi-modal depth estimation based on camera–LiDAR sensor fusion

Obando-Ceron, Johan S.; Romero-Cano, Victor; Monteiro, Sildomar

doi:10.1007/s00138-023-01426-x

Probabilistic multi-modal depth estimation based on camera–LiDAR sensor fusion

Original Paper
Open access
Published: 29 July 2023

Volume 34, article number 79, (2023)
Cite this article

Download PDF

You have full access to this open access article

Machine Vision and Applications Aims and scope Submit manuscript

Probabilistic multi-modal depth estimation based on camera–LiDAR sensor fusion

Download PDF

1374 Accesses
Explore all metrics

Abstract

Multi-modal depth estimation is one of the key challenges for endowing autonomous machines with robust robotic perception capabilities. There have been outstanding advances in the development of uni-modal depth estimation techniques based on either monocular cameras, because of their rich resolution, or LiDAR sensors, due to the precise geometric data they provide. However, each of these suffers from some inherent drawbacks, such as high sensitivity to changes in illumination conditions in the case of cameras and limited resolution for the LiDARs. Sensor fusion can be used to combine the merits and compensate for the downsides of these two kinds of sensors. Nevertheless, current fusion methods work at a high level. They process the sensor data streams independently and combine the high-level estimates obtained for each sensor. In this paper, we tackle the problem at a low level, fusing the raw sensor streams, thus obtaining depth estimates which are both dense and precise, and can be used as a unified multi-modal data source for higher-level estimation problems. This work proposes a conditional random field model with multiple geometry and appearance potentials. It seamlessly represents the problem of estimating dense depth maps from camera and LiDAR data. The model can be optimized efficiently using the conjugate gradient squared algorithm. The proposed method was evaluated and compared with the state of the art using the commonly used KITTI benchmark dataset.

Deep learning-based 3D reconstruction: a survey

Article 28 January 2023

3D Object Detection for Autonomous Driving: A Comprehensive Survey

Article 27 April 2023

YOLO-SLAM: A semantic SLAM system towards dynamic environment with geometric constraint

Article 08 January 2022

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Autonomous robots are composed of different modules that allow them to perceive, learn, decide and act within their environment. The perception module processes cues that inform the robot about the appearance and geometry of the environment. Particularly when working in outdoor or underground scenarios, these cues must be robust to unseen phenomenon. A fully autonomous robot must execute all operations, monitor itself and be able to handle all unprecedented events and conditions, such as unexpected objects and debris on the road, unseen environments and adverse weather. Therefore, reliable and robust perception of the surrounding environment is one of the key tasks of autonomous robotics.

Among the main inputs for a perception system are the distances of the robot from multiple points in its environment. This input can be obtained directly from a sensor or estimated by a depth estimation module. Depth estimation can be performed by processing monocular camera images, stereo vision, radar or LiDAR (Light Detector and Ranging) sensors, among others. Although monocular cameras can only be used to generate depth information up-to-a-scale, they are still an important component of a depth estimation system due to their low price and the rich appearance data they provide. Although a monocular camera is small, low-cost and energy efficient, it is very sensitive to changes in the illumination. Additionally, the accuracy and reliability of depth estimation methods based on monocular images is still far from being practical. For instance, the state-of-the-art RGB-based depth prediction methods [63] follow this intuition but require the fused modalities to have similar coverage densities. Our proposed framework provides a procedure for fusing LiDAR and image data independently of the LiDAR data’s density. The main contribution of this paper is a depth regression model that takes both a sparse set of depth samples and RGB images as the inputs and predicts a full-resolution depth map. This is achieved by modelling the problem of fusing low-resolution depth images with high-resolution camera images as a conditional random field (CRF).

The intuition behind our CRF formulation is that depth discontinuities in a scene often co-occur with changes in colour or brightness within the associated camera image. Since the camera image is commonly available at much higher resolution, this insight can be used to enhance the resolution and accuracy of the depth image. A depth map will be produced by our approach using three features as illustrated in Fig. 1. The first one is an RGB colour image from the camera sensor, top image in Fig. 1 (a). The second one is 2D sparse depth map captured by a LiDAR sensor, middle image in Fig. 1 (b). The third feature is a surface-normal map generated from the sparse depth samples, bottom image in Fig. 1 (c).

The rest of this paper is organized as follows. Section 2 reviews related work on depth estimation. Section 3 explains how the LiDAR points and the camera images were registered. In Sect. 4, we first introduce our CRF-Fusion framework, and then, we provide a detailed explanation of the proposed model: the energy potentials that compose our CRF model and its inference machine. The experimental validation, performed on the KITTI dataset, is reported in Sect. 5. Finally, conclusions and directions for future work are listed in Sect. 6.

2 Related work

Depth estimation from monocular images is a long-standing problem in computer vision. Early works on depth estimation using RGB images usually relied on handcrafted features and inference on probabilistic graphical models. Classical methods include shape-from-shading [72] and shape-from-defocus [62]. Other early methods were based on hand-tuned models or assumptions about the orientations of the surfaces [3 illustrates their closest neighbours, whose centroids are represented by yellow dots.

Figure 4 shows the 4 closest neighbours selected. These neighbours are represented by dark green lines and dark yellow dots. Note that for the superpixels located at the corners, respectively, on the edges, only the 2, respectively, 3, closest neighbours need to be found.

4 CRF-based camera–LIDAR fusion for depth estimation

In this paper, depth estimation is formulated as a superpixel-level inference task on a modified conditional random field (CRF). Our proposed model is a multi-sensor extension of the classical pairwise CRF. In this section, we first briefly introduce the CRF model. Then, we show how to fuse the information of an image and a sparse LIDAR point cloud with our novel CRF framework.

4.1 Overview

The conditional random field (CRF) is a type of undirected probabilistic graphical model which is widely used for solving labelling problems. Formally, let ${\textbf{X}}=\left\{ X_{1}, X_{2}, \ldots , X_{N}\right\} $ be a set of discrete random variables to be inferred from an observation or input tensor ${\textbf{Y}}$, which in turn is composed of the observation variables $c_{i}$ and $y_{i}$, where i is an index over superpixels. For each superpixel i, the variable $c_{i}$ corresponds to an observed three-dimensional colour value and $y_{i}$ is an observed range measurement. The goal of our framework is to infer the depth of each pixel in a single image depicting general scenes. Following the work of [11, 45], we make the common assumption that an image is composed of small homogeneous regions (superpixels) and consider a graphical model composed of nodes defined on superpixels. Note that our framework is flexible and can estimate depth values on either pixels or superpixels. The remaining question is how to parametrize this undirected graph. Because the interaction between adjacent nodes in the graph is not directed, there is no reason to use a standard conditional probability distribution (CPD), in which one represents the distribution over one node given the others. Rather, we need a more symmetric parametrization. Intuitively, we want our model to capture the affinities between the depth estimates of the superpixels in a given neighbourhood. These affinities can be captured as follows: Let ${\tilde{P}}(X,Y)$ be an unnormalized Gibbs joint distribution parametrized as a product of factors $\Phi $, where

$$\begin{aligned} \Phi =\left\{ \phi _{1}\left( D_{1}\right) , \ldots , \phi _{k}\left( D_{k}\right) \right\} , \end{aligned}$$

and

$$\begin{aligned} {\tilde{P}}(X,Y)=\prod _{i=1}^{m} \phi _{i}\left( D_{i}\right) . \end{aligned}$$

We can then write a conditional probability distribution of the depth estimates X given the observations Y using the previously introduced Gibbs distribution, as follows:

$$\begin{aligned} Pr(X | Y)=\frac{P(X, Y)}{Z(Y)} \end{aligned}$$

where

$$\begin{aligned} Z(Y)=\sum _{X} {\tilde{P}}(X, Y). \end{aligned}$$

Here, Z(Y), also known as ‘the partition function’, works as a normalizing factor which marginalizes X from ${\tilde{P}}(X,Y)$, allowing the calculation of the probability distribution P(X|Y):

$$\begin{aligned} P(X | Y)=\frac{1}{\sum _{X} {\tilde{P}}(X, Y)} {\tilde{P}}(X,Y). \end{aligned}$$

Therefore, similar to conventional CRFs, we model the conditional probability distribution of the data with the following density function:

$$\begin{aligned} {\text {P}}({\textbf{X}} | {\textbf{y}})=\frac{1}{\textrm{Z}({\textbf{Y}})} \exp (-E({\textbf{X}}, {\textbf{Y}})) \end{aligned}$$

where E is the energy function and Z is the partition function defined by

$$\begin{aligned} \textrm{Z}(\textrm{Y})=\int _{\textrm{Y}} \exp \{-E(\textrm{X}, \textrm{Y})\} \textrm{dY}. \end{aligned}$$

Since Z is continuous, this integral equation can be analytically solved. This is different from the discrete case, in which approximation methods need to be applied. To predict the depths of a new image, we solve the following maximum a posteriori (MAP) inference problem:

To simplify the solution for the energy function, one can take the negative logarithm of the left-hand side and right-hand side of the equation of the probability distribution Pr(X|Y): then, the problem of maximizing the conditional probability becomes an energy minimization problem. Therefore, maximizing the probability distribution ${\text {Pr}}({\textbf{X}} | {\textbf{Y}})$ is equivalent to minimizing the corresponding energy function:

$$\begin{aligned} {\textbf{x}}^{\star }=\arg \min _{{\textbf{x}}} E({\textbf{X}}, {\textbf{Y}}). \end{aligned}$$

We formulate the energy function as a typical combination of unary potentials U and pairwise potentials V over the nodes (superpixels) N and edges S of the image x:

$$\begin{aligned} E({\textbf{X}}, {\textbf{Y}})=\sum _{p \in {\mathcal {N}}} U\left( x_{p}, {\textbf{y}}\right) +\sum _{(p, q) \in {\mathcal {S}}} V\left( x_{p}, x_{q}, {\textbf{y}}\right) \end{aligned}$$

The unary term U aims to regress the depth value from a single superpixel. The pairwise term V encourages neighbouring superpixels with similar appearances to take similar depths [11, 23].

4.2 Potential functions

The proposed multi-modal depth estimation model is composed of unary and pairwise potentials. For an input image, which has been over-segmented into n superpixels, we define a unary potential for each superpixel. The pairwise potentials are defined over the four-neighbour vicinity of each superpixel. The unary potentials are built by aggregating all LiDAR observations inside each superpixel. The pairwise part is composed of similarity vectors, each with K components, that measure the agreement between different features of neighbouring superpixel pairs. Therefore, we explicitly model the relations between neighbouring superpixels through pairwise potentials. In the following, we describe the details of the potentials involved in our energy function.

4.2.1 Unary potential

The unary potential is constructed from the LiDAR sensor measurements by considering the least square loss between the estimated $x_{i}$ and observed $y_{i}$ depth values:

$$\begin{aligned} \Phi ({\textbf{x}}, {\textbf{y}})= & {} \sum _{i \in {\mathscr {L}}} \sigma _{i}\left( x_{i}-y_{i}\right) ^{2}\\ \Phi ({\textbf{x}},{\textbf{y}})= & {} \Vert {\textbf{W}} ({\textbf{x}}-{\textbf{y}})\Vert ^{2} \end{aligned}$$

where ${\mathscr {L}}$ is the set of indices for which a depth measurement is available, and $\sigma _{i}$ is a constant weight placed on the depth measurements. This potential measures the quadratic distance between the estimated range X and the measured range Y, where available. Finally, in order to write the unary potential in a more efficient matrix form, we define the diagonal matrix W with entries

$$\begin{aligned} {\textbf{W}}_{i, i}= \left\{ \begin{array}{ll} {\sigma _{i}} &{} \quad {\text {if } i \in {\mathscr {L}}} \\ {0} &{} \quad {\text {otherwise }} \end{array}\right. \end{aligned}$$

4.2.2 Colour pairwise potential

We construct a pairwise potential from K types of similarity observations, each of which enforces smoothness by exploiting colour consistency features of the neighbouring superpixels. This pairwise potential can be written as

$$\begin{aligned} \Psi ^{c}({\textbf{x}}, {\textbf{I}})= & {} \sum _{i} \sum _{j \in {\mathscr {N}}(i)} e_{i, j}\left( x_{i}-x_{j}\right) ^{2}\\ \Psi ^{c}({\textbf{x}}, {\textbf{I}})= & {} \Vert {\textbf{S}} {\textbf{x}}\Vert ^{2} \end{aligned}$$

where I is an RGB image, ${\mathscr {N}}(i)$ is the set of horizontal and vertical neighbours of i, and each row of S represents the weighting factors for pairs of adjacent range nodes. As the edge strength between nodes, we use an exponentiated $L_{2}$ norm of the difference in pixel appearance.

$$\begin{aligned} e_{i, j}=\exp -\frac{\left\| {\textbf{c}}_{i} -{\textbf{c}}_{j}\right\| ^{2}}{\sigma _{d}^{2}} \end{aligned}$$

where ${\textbf{c}}_{i}$ is the RGB colour vector of pixel i and $\sigma _{d}$ is a tuning parameter. A small value of $\sigma _{d}$ increases the sensitivity to changes in the image. Thanks to this potential, the lack of content or features in the RGB image is considered by our model as indicative of a homogeneous depth distribution, in other words, a planar surface.

4.2.3 Surface-normal pairwise potential

The mathematical formulation of this potential is similar to the previous colour potential. However, the surface-normal potential considers surface-normal similarities instead of colour. The weighting factors $nr_{i, j}$ for this case are formulated using the cosine similarity, which is a measure of the similarity between two nonzero vectors of an inner product space that employs the cosine of the angle between them. The cosine of 0 is 1, and it is less than 1 for any angle in the interval $[0, \pi ]$ radians. It is thus a measurement of orientation instead of magnitude [60]. The cosine of two nonzero vectors can be found by using the Euclidean dot product formula:

$$\begin{aligned} {\textbf{A}} \cdot {\textbf{B}}=\Vert {\textbf{A}}\Vert \Vert {\textbf{B}}\Vert \cos \theta \end{aligned}$$

Therefore, the cosine similarity can be expressed by

$$\begin{aligned} \cos (\theta )=\frac{{\textbf{A}} \cdot {\textbf{B}}}{\Vert {\textbf{A}}\Vert \Vert {\textbf{B}}\Vert } = \frac{\sum _{t=1}^{n} A_{i} B_{i}}{\sqrt{\sum _{t=1}^{n} A_{i}^{2}} \sqrt{\sum _{t=1}^{n} B_{i}^{2}}} \end{aligned}$$

where $A_i$ and $B_i$ are the components of vectors A and B, respectively. Finally, we define our surface-normal potential by the following equations.

$$\begin{aligned} \Psi ^{n}({\textbf{x}}, \textbf{In})= & {} \sum _{i} \sum _{j \in {\mathscr {N}}(i)} nr_{i, j}\left( x_{i}-x_{j}\right) ^{2}\\ \Psi ^{n}({\textbf{x}}, \textbf{In})= & {} \Vert {\textbf{P}} {\textbf{x}}\Vert ^{2}\\ nr_{i, j}= & {} \frac{\sum _{t=1}^{n} In_{i} In_{j}}{\sqrt{\sum _{t=1}^{n} In_{i}^{2}} \sqrt{\sum _{t=1}^{n} In_{j}^{2}}} \end{aligned}$$

4.2.4 Depth pairwise potential

This pairwise potential encodes a smoothness prior over depth estimates which encourages neighbouring superpixels in the image to have similar depths. Usually, pairwise potentials are only related to the colour difference between pairs of superpixels. However, depth smoothness is a valid hypothesis which can potentially enhance depth inference. To enforce depth smoothness, a distance-aware Potts model was adopted. Neighbouring points with smaller distances are considered to be more likely to have the same depth. The mathematical formulation of this potential is similar to the colour pairwise potential, as it follows the Potts model:

$$\begin{aligned} \Psi ^{d}({\textbf{x}}, {\textbf{D}})=\sum _{i} \sum _{j \in {\mathscr {N}}(i)} e_{i, j}\left( x_{i}-x_{j}\right) ^{2} \end{aligned}$$

and the weighting factor $dp_{i, j}$ for this case is formulated as

$$\begin{aligned} dp_{i, j}=\exp -\frac{\left\| {\textbf{p}}_{i} -{\textbf{p}}_{j}\right\| ^{2}}{\sigma _{p}^{2}} \end{aligned}$$

where ${\textbf{p}}_{i}$ is the 3D location vector of the LiDAR point i and $\sigma _{p}$ is a parameter controlling the strength of enforcing close points to have similar depth values.

4.2.5 Uncertainty potential:

Depth uncertainty estimation is important for refining depth estimation [16, 65], and in safety critical systems [28]. It allows an agent to identify unknowns in an environment in order to reach optimal decisions. Our method provides uncertainties for the estimates of the pixel-wise depths by taking into account the number of LiDAR points present for each superpixel. The uncertainty potential is similar to the unary potential. It is constructed from the number of LiDAR points projected onto a superpixel and employs the following least square loss:

$$\begin{aligned} U^{c}({\textbf{x}}, {\textbf{y}})= & {} \sum _{i \in {\mathscr {L}}} \sigma _{i}\left( x_{i}-unc_{i}\right) ^{2}\\ U^{c}({\textbf{x}}, {\textbf{y}})= & {} \Vert {\textbf{W}} ({\textbf{x}}-\textbf{unc})\Vert ^{2} \end{aligned}$$

where $\textbf{unc}$ is defined as follows:

$$\begin{aligned} \textbf{unc}_{i, i}= \left\{ \begin{array}{ll} {\sigma _{i}} &{} \quad {\text {if P projected on SPx is 0}}\\ {\psi {i}}&{} \quad {\text {if P projected on SPx is >0 and <2}}\\ {mean} &{} \quad {\text {otherwise}}\end{array}\right. \end{aligned}$$

where P is a 3D point and SPx is a superpixel. In locations with accurate and sufficiently many LiDAR points, the model will produce depth predictions with a high confidence. This uncertainty estimation provides a measure of how confident the model is about the depth estimation. This results in an overall better performance, since uncertain estimates with high uncertainty can be neglected by higher-level tasks that use the estimated depth maps as an input.

4.3 Optimization

With the unary and the pairwise potentials defined, we can now write the energy function as

$$\begin{aligned} E({\textbf{X}}, {\textbf{Y}})= & {} \left( \alpha \right) \Phi ({\textbf{x}}, {\textbf{y}})+(\beta )\Psi ^{c}({\textbf{x}}, {\textbf{I}}) \ldots \nonumber \\{} & {} +\ldots (\gamma )\Psi ^{n}({\textbf{x}}, \textbf{In})+(\delta )\Psi ^{d}({\textbf{x}}, \textbf{In}) \end{aligned}$$

(1)

The scalars $\alpha $, $\beta $, $\gamma $, $\delta $ $\in $ [0,1] are weightings for the four terms. We may further expand the unary and pairwise potentials to

$$\begin{aligned}{} & {} \Phi ({\textbf{x}}, {\textbf{y}})= \alpha ({\textbf{x}}^{\textrm{T}} {\textbf{W}}^{\textrm{T}} {\textbf{W}} {\textbf{x}}-2 {\textbf{z}}^{\textrm{T}} {\textbf{W}}^{\textrm{T}} {\textbf{W}} {\textbf{x}}+{\textbf{z}}^{\textrm{T}} {\textbf{W}}^{\textrm{T}} {\textbf{W}} {\textbf{z}}) \end{aligned}$$

(2)

$$\begin{aligned}{} & {} \Psi ^{c}({\textbf{x}}, \textbf{In}) = \beta ({\textbf{x}}^{{\textbf{T}}} {\textbf{S}}^{{\textbf{T}}} {\textbf{S}}_{{\textbf{X}}}) \end{aligned}$$

(3)

$$\begin{aligned}{} & {} \Psi ^{n}({\textbf{x}}, \textbf{In})=\gamma ({\textbf{x}}^{{\textbf{T}}} {\textbf{P}}^{{\textbf{T}}} {\textbf{P}}_{{\textbf{X}}}) \end{aligned}$$

(4)

$$\begin{aligned}{} & {} \Psi ^{d}({\textbf{x}}, \textbf{In})=\delta ({\textbf{x}}^{{\textbf{T}}} {\textbf{D}}^{{\textbf{T}}} {\textbf{D}}_{{\textbf{X}}}) \end{aligned}$$

(5)

We shall pose the problem as one of finding the optimal range vector ${\textbf{x}}^{*}$ such that:

$$\begin{aligned} {\textbf{x}}^{*}=\underset{{\textbf{x}}}{{\text {argmin}}}\left\{ E({\textbf{X}}, {\textbf{Y}}) \right\} \end{aligned}$$

Substituting equations 2, 3, 4 and 5 into 1 and solving for x reduces the problem to: $ \textbf{A x}={\textbf{b}}$ where

$$\begin{aligned} {\textbf{A}}= & {} \alpha ({\textbf{W}}^{{\textbf{T}}}{\textbf{W}})+ \beta ({\textbf{S}}^{{\textbf{T}}}{\textbf{S}}) + \gamma ({\textbf{P}}^{{\textbf{T}}}{\textbf{P}}) + \delta ({\textbf{D}}^{{\textbf{T}}}{\textbf{D}})\\ {\textbf{b}}= & {} \alpha ({\textbf{W}}^{{\textbf{T}}} {\textbf{W}} {\textbf{z}}) \end{aligned}$$

All we need to do to perform the optimization is to solve a large sparse linear system. The methods for solving sparse systems are divided into two categories: direct and iterative. Direct methods are robust but require large amounts of memory as the size of the problem grows. On the other hand, iterative methods provide better performance but may exhibit numerical problems [10, 20]. In the present paper, the fast algorithm conjugate gradient squared proposed by Hestenes and Stiefel [25, 64] is employed to solve the energy minimization problem.

4.4 Pseudo-code

Algorithm 1 provides the complete pseudo-code for our proposed framework, which is previously illustrated in Fig. 5. In this algorithm, lines 1 to 5 perform the preprocessing, which includes gathering the multi-modal raw data, building a connection graph between pairs of adjacent superpixels and projecting the clustered LiDAR points onto the image space. Lines 6 to 11 constitute the core of the approach. They include constructing the cost function using different potentials (unary and pairwise) to obtain the complete CRF for the depth estimation. The objective of the pairwise potentials is to smooth the depth regressed from the unary part based on the neighbouring superpixels. The pairwise potential functions are based on standard CRF vertex and edge feature functions studied extensively in [52] and other papers. Our model uses both the content information of the superpixels and relation information between them to infer the depth.

5 Results and discussion

We evaluate our approach on the raw sequences of the KITTI benchmark, which is a popular dataset for single image depth map prediction. The sequences contain stereo imagery taken from a car driving in an urban scenario. The dataset also provides 3D laser measurements from a Velodyne laser scanner, which we use as ground truth measurements (projected into the stereo images using the given intrinsics and extrinsics in KITTI). This dataset has been used to train and evaluate the state-of-the-art methods and allows quantitative comparisons. First, we evaluate the prediction accuracy of our proposed method with different potentials in Sect. 5.2. Second, in Sect. 5.3 we explore the impact on the depth estimation of the number of sparse depth samples and the number of superpixels. Third, Sect. 5.5 compares our approach to state-of-the-art methods on the KITTI dataset. Lastly, in Sects. 5.6 and 5.7, we demonstrate two use cases of our proposed algorithm, one for creating LiDAR super-resolution from sensor data provided by the KITTI dataset and another one for a dataset collected in the context of this work.

5.1 Evaluation metrics

We evaluate the accuracy of our method in depth prediction using the 3D laser ground truth on the test images. We use the following depth evaluation metrics: root-mean-squared error (RMSE), mean absolute error (MAE) and mean absolute relative error (REL), among which RMSE is the most important indicator and chosen to rank submissions on the leader-board since it measures error directly on depth and penalizes on further distance where depth measurement is more challenging. These metrics were used by [13, 17, 30, 53] to estimate the accuracy of monocular depth prediction.

$$\begin{aligned} RMSE= & {} \sqrt{\frac{1}{|T|} \sum _{d \in T}\Vert {\hat{d}}-d\Vert ^{2}}\\ MAE= & {} \frac{1}{T} \sum _{d \in T}\Vert {\hat{d}}-d\Vert ^{2}\\ REL= & {} \frac{1}{T} \sum _{d \in T}\left( \frac{\left| {\hat{d}} -d\right| }{{\hat{d}}}\right) \end{aligned}$$

Here, d is the ground truth depth, ${\hat{d}}$ is the estimated depth, and T denotes the set of all points in the test set images. In order to compare our results with those of Eigen et al. [13] and Godard et al. [21], we crop our image to the evaluation crop applied by Eigen et al. We also use the same resolution of the ground truth depth image and cap the predicted depth at 80 m [21].

5.2 Architecture evaluation

This section presents an empirical study of the impact on the accuracy of the depth prediction of different choices for the potential functions and hyperparameters. In the first experiment, we compare the impact of sequentially adding our proposed pairwise potentials. We first evaluate a model with only unary and colour pairwise potentials. Then, we added the surface-normal pairwise potential, and finally, the depth pairwise potential is included. As shown in Table 1, the RMSE is improved after adding each pairwise potential.

Table 1 Depth completion errors [mm] after adding pairwise potentials (lower is better)

Full size table

5.3 The number of superpixels

In this section, we explore the relation between the prediction accuracy and the number of available depth samples and the number of superpixels.

As displayed in Fig. 5, a greater number of superpixels yields better results in error measurements. Although a larger number of sparse depth observations improves the quality of the depth map, the performance converges when the number of superpixels is more than 5000, which is about 1.5% of the total number of pixels. We ran an exhaustive evaluation of our method for a different amount of superpixels. Figure 8 clearly shows that our method’s error decreases with an increased number of superpixels.

Table 2 Depth completion errors [mm] for different number of superpixels (lower is better)

Full size table

5.4 Sub-sampling 3D depth points

We performed a quantitative analysis of the impact of observed 3D point sparsity on the error of our proposed method, by decreasing the amount of 3D points considered during inference. As it is shown in Fig. 9, our method’s error decreases with an increased number of 3D depth points, enabling a better dense depth map estimation. From Fig. 9, we can argue that the amount of 3D depth points is really important for an accurate depth map estimation. Even though our estimation error increases with the sparsity of the depth observations, our method manages to provide state-of-the-art performances even when the depth observations are sampled down to 40%.

5.5 Algorithm evaluation for depth completion

This is a more challenging dataset than other datasets for depth estimation: the distances in the KITTI dataset are larger than in other datasets, e.g. NYU-Depth-V2 dataset. Hence, the KITTI odometry dataset is more challenging for the depth estimation task. The performance of our method and those of other existing methods on the KITTI dataset are shown in Table 3. Table 3 shows that the proposed method outperforms other depth map estimation approaches which are well accepted in the robotics community. Our model relies on the number of superpixels and the resolution of input data sources. This means that the model’s performance will increase if we increase the number of superpixels, the image resolution and the density of the LiDAR data.

Table 3 Depth completion errors [mm] by different methods on the test set of KITTI depth completion benchmark (lower is better)

Full size table

5.6 Algorithm evaluation for LiDAR super-resolution

We present another demonstration of our method in super-resolution of LiDAR measurements. 3D LiDARs have a low vertical angular resolution and thus generate a vertically sparse point cloud. We use all measurements in the sparse depth image and RGB images as input to our framework. An example is shown in Fig. 4. The cars are much more recognizable in the prediction than in the raw scans.

On the other hand, starting from a LiDAR super-resolution map we can generate a 3D reconstruction of the scene. The reconstruction of three-dimensional (3D) scenes has many important applications, such as autonomous navigation [24], environmental monitoring [46] and other computer vision tasks [26]. Therefore, a dense and accurate model of the environment is crucial for autonomous vehicles. In fact, imprecise representations of the vehicle’s surrounding may lead to unexpected situations that could endanger the passengers. In this paper, the 3D modelling is generated using a combination of image and range data are a sensor fusion approach that takes the strengths of each in order to overcome their limitations. Images normally have higher resolution and more visual information than range data, and range data are noisy, sparse, and have less visual information, but already contain 3D information. The qualitative and quantitative results presented here suggest that our system provides 3D reconstructions of reasonable quality. Following [35, 41], could be explored as a way of improving our system’s robustness to challenging environmental conditions.

References

Achanta, R., Shaji, A., Smith, K., Lucchi, A., Fua, P., Susstrunk, S.: Slic superpixels compared to state-of-the-art superpixel methods. IEEE Trans. Pattern Anal. Mach. Intell. 34, 2274–2282 (2012)
Article Google Scholar
Agamennoni, G., Furgale, P., Siegwart, R.: Self-tuning m-estimators 4628–4635 (2015)
Andreasson, H., Triebel, R., Lilienthal, A.: Vision-based interpolation of 3d laser scans (2006)
Bansal, A., Russell, B., Gupta, A.: Marr revisited: 2d-3d alignment via surface normal prediction (2016). ar**v:1604.01347
Bo Li, Chunhua Shen, Yuchao Dai, van den Hengel, A., Mingyi He: Depth and surface normal estimation from monocular images using regression on deep features and hierarchical crfs 1119–1127 (2015)
Cadena, C., Dick, A., Reid, I.: Multi-modal auto-encoders as joint estimators for robotics scene understanding (2016)
Castanedo, F.: A review of data fusion techniques. TheScientificWorldJournal 2013, 704504 (2013)
Article Google Scholar
Ceron, J.S.O., Cano, V.R., Neuta, N.L., Toro, W.M.: Probabilistic perception system for object classification based on camera-lidar sensor fusion. In: LatinX in AI Research at ICML (2019)
Cheng, X., Wang, P., Yang, R.: Depth estimation via affinity learned with convolutional spatial propagation network (2018). ar**v:1808.00150
Davis, T.A., Rajamanickam, S., Sid-Lakhdar, W.M.: A survey of direct methods for sparse linear systems. Acta Numerica 25, 383–566 (2016)
Article MathSciNet MATH Google Scholar
Diebel, J., Thrun, S.: An application of markov random fields to range sensing 291–298 (2006)
Eigen, D., Fergus, R.: Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture (2014). ar**v:1411.4734
Eigen, D., Puhrsch, C., Fergus, R.: Depth map prediction from a single image using a multi-scale deep network (2014). ar**v:1406.2283
Eitel, A., Springenberg, J.T., Spinello, L., Riedmiller, M., Burgard, W.: Multimodal deep learning for robust rgb-d object recognition (2015). ar**v:1507.06821
Fu, C., Mertz, C., Dolan, J.: Lidar and monocular camera fusion: On-road depth completion for autonomous driving. In: 2019 IEEE Intelligent Transportation Systems Conference (ITSC), pp. 273–278 (2019)
Fu, H., Gong, M., Wang, C., Batmanghelich, K., Tao, D.: Deep ordinal regression network for monocular depth estimation (2018). ar**v:1806.02446
Gansbeke, W.V., Neven, D., Brabandere, B.D., Gool, L.V.: Sparse and noisy lidar completion with RGB guidance and uncertainty. CoRR abs/1902.05356 (2019). ar**v:1902.05356
Geiger, A.: Are we ready for autonomous driving? the kitti vision benchmark suite 3354–3361 (2012)
Geiger, A., Lenz, P., Stiller, C., Urtasun, R.: Vision meets robotics: The kitti dataset. The International Journal of Robotics Research 32, 1231–1237 (2013)
Article Google Scholar
George, A., Gilbert, J., Liu, J.: Graph theory and sparse matrix computation (2012)
Godard, C., Aodha, O.M., Brostow, G.J.: Unsupervised monocular depth estimation with left-right consistency (2016). ar**v:1609.03677
Hambarde, P., Murala, S.: S2dnet: Depth estimation from single image and sparse samples. IEEE Transactions on Computational Imaging 6, 806–817 (2020)
Article Google Scholar
Harrison, A., Newman, P.: Image and sparse laser fusion for dense scene reconstruction 219–228 (2009)
Hecht, J.: Cutting-Edge Cinema. Optics & Photonics News 29, 28 (2018)
Article Google Scholar
Hestenes, M.R., Stiefel, E.: Methods of conjugate gradients for solving linear systems. J Res NIST 49, 409–436 (1952)
MathSciNet MATH Google Scholar
Horaud, R., Hansard, M., Evangelidis, G., Clément, M.: An overview of depth cameras and range scanners based on time-of-flight technologies. Machine Vision and Applications 27 (2016)
Huang, Z., Fan, J., Cheng, S., Yi, S., Wang, X., Li, H.: Hms-net: Hierarchical multi-scale sparsity-invariant network for sparse depth completion. IEEE Transactions on Image Processing PP, 1–1 (2019)
MATH Google Scholar
Kendall, A., Gal, Y.: What uncertainties do we need in bayesian deep learning for computer vision? (2017). ar**v:1703.04977
Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks 1097–1105 (2012)
Kuznietsov, Y., Stückler, J., Leibe, B.: Semi-supervised deep learning for monocular depth map prediction (2017). ar**v:1702.02706
Laina, I., Rupprecht, C., Belagiannis, V., Tombari, F., Navab, N.: Deeper depth prediction with fully convolutional residual networks (2016a). ar**v:1606.00373
Laina, I., Rupprecht, C., Belagiannis, V., Tombari, F., Navab, N.: Deeper depth prediction with fully convolutional residual networks (2016b). ar**v:1606.00373
Lee, B., Jeon, H., Im, S., Kweon, I.S.: Depth completion with deep geometry and context guidance 3281–3287 (2019)
Liao, Y., Huang, L., Wang, Y., Kodagoda, S., Yu, Y., Liu, Y.: Parse geometry from a line: Monocular depth estimation with partial laser observation (2016). ar**v:1611.02174
Lin, J.T., Dai, D., Gool, L.V.: Depth estimation from monocular images and sparse radar data. In: 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 10233–10240 (2020). https://doi.org/10.1109/IROS45743.2020.9340998
Liu, F., Shen, C., Lin, G.: Deep convolutional neural fields for depth estimation from a single image (2014a). ar**v:1411.6387
Liu, F., Shen, C., Lin, G.: Deep convolutional neural fields for depth estimation from a single image (2014b). ar**v:1411.6387
Liu, F., Shen, C., Lin, G., Reid, I.: Learning depth from single monocular images using deep convolutional neural fields. IEEE Transactions on Pattern Analysis and Machine Intelligence 38, 2024–2039 (2016)
Article Google Scholar
Liu, M., Salzmann, M., He, X.: Discrete-continuous depth estimation from a single image 716–723 (2014)
Liu, N., Li, S., Du, Y., Torralba, A., Tenenbaum, J.B.: Compositional visual generation with composable diffusion models (2022). ar**v:2206.01714
Long, Y., Morris, D., Liu, X., Castro, M., Chakravarty, P., Narayanan, P.: Radar-camera pixel depth association for depth completion. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12507–12516 (2021)
Ma, F., Cavalheiro, G.V., Karaman, S.: Self-supervised sparse-to-dense: Self-supervised depth completion from lidar and monocular camera (2018). ar**v:1807.00275
Ma, F., Karaman, S.: Sparse-to-dense: Depth prediction from sparse depth samples and a single image (2017). ar**v:1709.07492
Maddern, W., Newman, P.: Real-time probabilistic fusion of sparse 3d lidar and dense stereo 2181–2188 (2016)
Mahmoudi, M., Sapiro, G.: Sparse representations for range data restoration. Trans. Img. Proc. 21, 2909–2915 (2012)
Article MathSciNet MATH Google Scholar
Mallet, C., Bretar, F.: Full-waveform topographic lidar: State-of-the-art. ISPRS Journal of Photogrammetry and Remote Sensing 64, 1–16 (2009)
Article Google Scholar
Mastin, A., Kepner, J., Fisher, J.: Automatic registration of lidar and optical images of urban scenes 2639–2646 (2009)
Nickels, K., Castano, A., Cianci, C.M.: Fusion of lidar and stereo range for mobile robots (2003)
Park, K., Kim, S., Sohn, K.: High-precision depth estimation with the 3D LiDAR and stereo fusion. In: IEEE International Conference on Robotics and Automation, Brisbane. pp. 2156–2163 (2018)
Peng Wang, **aohui Shen, Zhe Lin, Cohen, S., Price, B., Yuille, A.: Towards unified depth and semantic prediction from a single image 2800–2809 (2015)
Piniés, P., Paz, L.M., Newman, P.: Too much tv is bad: Dense reconstruction from sparse laser with non-convex regularisation 135–142 (2015)
Qin, T., Liu, T.Y., Zhang, X.D., Wang, D.S., Li, H.: Global ranking using continuous conditional random fields 1281–1288 (2008)
Qiu, J., Cui, Z., Zhang, Y., Zhang, X., Liu, S., Zeng, B., Pollefeys, M.: Deeplidar: Deep surface normal guided depth prediction for outdoor scene from sparse lidar data and single color image (2018). ar**v:1812.00488
Romero-Cano, V., Nieto, J.I.: Stereo-based motion detection and tracking from a moving platform 499–504 (2013)
Romero-Cano, V., Vignard, N., Laugier, C.: Xdvision: Dense outdoor perception for autonomous vehicles 752–757 (2017)
Roy, A., Todorovic, S.: Monocular depth estimation using neural regression forest 5506–5514 (2016)
Saxena, A., Chung, S.H., Ng, A.Y.: Learning depth from single monocular images 1161–1168 (2005)
Saxena, A., Sun, M., Ng, A.Y.: Make3d: Learning 3d scene structure from a single still image. IEEE Trans. Pattern Anal. Mach. Intell. 31, 824–840 (2009)
Article Google Scholar
Schneider, N., Schneider, L., **gera, P., Franke, U., Pollefeys, M., Stiller, C.: Semantically guided depth upsampling (2016). ar** 979–986 (2016)
Virtanen, P., Gommers, R., Oliphant, T.E., Haberland, M., Reddy, T., Cournapeau, D., Burovski, E., Peterson, P., Weckesser, W., Bright, J., van der Walt, S.J., Brett, M., Wilson, J., Jarrod Millman, K., Mayorov, N., Nelson, A.R.J., Jones, E., Kern, R., Larson, E., Carey, C., Polat, İ., Feng, Y., Moore, E.W., Vand erPlas, J., Laxalde, D., Perktold, J., Cimrman, R., Henriksen, I., Quintero, E.A., Harris, C.R., Archibald, A.M., Ribeiro, A.H., Pedregosa, F., van Mulbregt, P., Contributors, S...: SciPy 1.0: Fundamental Algorithms for Scientific Computing in Python. Nature Methods 17, 261–272. 2020
Walz, S., Gruber, T., Ritter, W., Dietmayer, K.: Uncertainty depth estimation with gated images for 3d reconstruction (2020)
Wang, B., Feng, Y., Liu, H.: Multi-scale features fusion from sparse lidar data and single image for depth completion. Electronics Letters 54, 1375–1377 (2018)
Article Google Scholar
**ang, R., Zheng, F., Su, H., Zhang, Z.: 3ddepthnet: Point cloud guided depth completion network for sparse depth and single color image (2020). ar**v:2003.09175
**e, J., Girshick, R., Farhadi, A.: Deep3d: Fully automatic 2d-to-3d video conversion with deep convolutional neural networks (2016). ar**v:1604.03650
Xu, D., Wang, W., Tang, H., Liu, H., Sebe, N., Ricci, E.: Structured attention guided convolutional neural fields for monocular depth estimation (2018). ar**v:1803.11029
Xu, Y., Zhu, X., Shi, J., Zhang, G., Bao, H., Li, H.: Depth completion from sparse LiDAR data with depth-normal constraints, in: IEEE International Conference on Computer Vision, pp. 2811–2820 (2019). https://doi.org/10.1109/ICCV.2019.00290, ar**v:1910.06727
Yang, Q., Yang, R., Davis, J., Nister, D.: Spatial-depth super resolution for range images 1–8 (2007)
Zhang, R., Tsai, P.S., Cryer, J.E., Shah, M.: Shape from shading: A survey. IEEE Trans. Pattern Anal. Mach. Intell. 21, 690–706 (1999)
Article MATH Google Scholar
Zhang, Y., Funkhouser, T.: Deep depth completion of a single rgb-d image (2018a). ar**v:1803.09326
Zhang, Y., Funkhouser, T.: Deep depth completion of a single rgb-d image (2018b). ar**v:1803.09326
Zhao, C., Sun, Q., Zhang, C., Tang, Y., Qian, F.: Monocular depth estimation based on deep learning: An overview. Science China Technological Sciences 63, 1612–1627 (2020)
Article Google Scholar
Zhou, T., Brown, M., Snavely, N., Lowe, D.G.: Unsupervised learning of depth and ego-motion from video (2017). ar**v:1704.07813

Download references

Acknowledgements

This work was supported by Universidad Autónoma de Occidente (UAO). The authors would like to thank the Research incubator in robotics and autonomous systems (RAS), the Research group on remote and distributed control systems (GITCoD) at UAO, Walter Mayor, Nicolas Llanos Neuta and Juan Carlos Perafán for their feedback and helpful discussions.

Funding

Open Access funding provided by Colombia Consortium

Author information

Authors and Affiliations

Robotics and Autonomous Systems Laboratory, Faculty of Engineering, Universidad Autónoma de Occidente, Valle del Cauca, Cali, Colombia
Victor Romero-Cano
Autonomy Research Division, Aurora Flight Sciences, Cambridge, MA, USA
Sildomar Monteiro
University of Montreal, Montréal, QC, Canada
Johan S. Obando-Ceron

Authors

Johan S. Obando-Ceron
View author publications
You can also search for this author in PubMed Google Scholar
Victor Romero-Cano
View author publications
You can also search for this author in PubMed Google Scholar
Sildomar Monteiro
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Victor Romero-Cano.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This document is a result of the research project 17INTER-297, funded by Universidad Autonoma de Occidente.

Johan S. Obando-Ceron undertook this work while he was part of Universidad Autónoma de Occidente.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Obando-Ceron, J.S., Romero-Cano, V. & Monteiro, S. Probabilistic multi-modal depth estimation based on camera–LiDAR sensor fusion. Machine Vision and Applications 34, 79 (2023). https://doi.org/10.1007/s00138-023-01426-x

Download citation

Received: 17 February 2022
Revised: 06 June 2023
Accepted: 28 June 2023
Published: 29 July 2023
DOI: https://doi.org/10.1007/s00138-023-01426-x

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Probabilistic multi-modal depth estimation based on camera–LiDAR sensor fusion

Abstract

Similar content being viewed by others

Deep learning-based 3D reconstruction: a survey

3D Object Detection for Autonomous Driving: A Comprehensive Survey

YOLO-SLAM: A semantic SLAM system towards dynamic environment with geometric constraint

1 Introduction

2 Related work