Keywords

1 Introduction

Estimating surface geometry is a fundamental problem in understanding the properties of an object and reconstructing its 3D information. There are two different approaches: geometric methods such as structure-from-motion and multi-view stereo, and photometric methods such as photometric stereo and shape-from-shading. The geometric methods are usually useful for metric reconstructions while the photometric methods are effective in estimating accurate per-pixel surface geometry.

Recently, with the massive use of commercial depth sensors, e.g., Kinect and RealSense, many works have been proposed to enhance the depth quality of the sensors by fusing the photometric cues of the color image [1, 2] or the near infrared (NIR) image [3, 4]. Although these methods have proven their effectiveness in photometric shape estimation and have provided promising results, they rely highly on the sensors and usually require heavy computational time.

On the other hand, deep convolutional neural networks (CNN) have been broadly used for various computer vision tasks such as image classification [1.

The major contributions of our work are as follows:

• First work analyzing the relationship between an NIR image and its surface normal using a deep learning framework.

• Fine-scale surface normal estimation using a single NIR image where the light direction need not be calibrated.

• Suitable design of an objective function to reconstruct the fine details of a target object surface.

Fig. 1.
figure 1

Comparison of reconstruction results, left: Input NIR image, middle: Our reconstruction from a single NIR image, right: ground-truth reconstruction using NIR images captured under 12 different lighting directions.

2 Related Work

Photometric Stereo and Shape from Shading. Photometric stereo [15] is one of the well-studied methods for estimating surface normals. By taking at least 3 images captured under different lighting directions, photometric stereo can determine a unique set of surface normals of an object. Also, the usage of more images makes the output increase in accuracy since it becomes an over-determined problem.

Shape from shading is a special case of photometric stereo, which predicts a shape from a single image. However, it is an ill-posed problem and needs to exploit many restrictions and constraints [16, 17]. Beginning with numerical SfS methods [18], many works have shown results based on the Lambertian BRDF assumption. Tsai et al. [19] use discrete approximation of surface normals. Lee and Kuo [20] estimate shape by using a triangular element surface model. We refer readers to [21] for better understanding regarding comparisons and evaluations of the classical SfS methods.

Shape from a NIR image has been recently studied in several literatures [3, 14]. They analyze the discriminative characteristics of NIR images and experimentally show the albedo (surface reflectance) simplicity in the NIR wavelength of various materials. In [3, 4], they propose the shape refinement methods using the photometric cues in NIR images. They show the high-quality shape recovery results, however they need an additional depth camera to obtain the results.

Although many conventional photometric approaches can work on NIR images and the albedo simplicity in the NIR image actually help robust estimation, estimating the surface normal from a single NIR image still have many limitations for practical uses, such as heavy computation time, heuristic assumptions, special system configuration, and the calibration of a light direction. To overcome those limitations, we study the map** from NIR intensity distributions to surface normal vectors via a deep CNN framework. We combine a GAN [22] with the specially designed objective function. Through the adversarial training process, our network naturally encodes the photometric cues of a scene and produces fine surface normals.

Data-Driven Shape Estimation. There have been various studies on estimating the shape information from images via data-driven approaches. Saxena et al. [23] estimate depths using a discriminatively trained MRF model with multiple scales of monocular cues. Hoiem et al. [24] reconstruct rough surface orientations of a scene by statistically modeling categories of coarse structures (e.g., ground, sky and vertical). Ladicky et al. [25] incorporate semantic labels of a scene to predict better depth outputs.

One of the emerging directions for shape estimation is using deep CNN. In [26], Fouhey et al. try to discover the right primitives in a scene. In [13], Wang et al. explore the effectiveness of CNNs for the tasks of surface normal estimation. Although this work infers the surface normals from a single color image, it outputs scene-level rough geometries and is not suitable for object-level detailed surface reconstruction. To estimate the object shape and the material property, Rematas et al. [\(\theta _D\) fixed, a generative parameter \(\theta _G\) is trained to produce the better quality of images, which could be misclassified by the discriminative network as real images. These procedures are repeated until they converge. This minimax objective is denoted as:

$$\begin{aligned} \min _{\varTheta _G}\max _{\varTheta _D} \displaystyle {{\mathrm{\mathbb {E}}}}_{F\sim D_{desire}}[logD( I )] + \displaystyle {{\mathrm{\mathbb {E}}}}_{Z\sim D_{input}}[log(1 - D(F) )] \end{aligned}$$
(1)

where \(D_{desire}\) is the distribution of images that we desired to estimate and \(D_{input}\) is that of the input domain. This objective function encourages D to be assigned to the correct label for both real and generated images and make G generate a realistic output F from an input Z. In our method, both the generative and the discriminative model are based on convolutional networks. The former takes a single NIR image as an input and results in a three-dimensional normal image as an output. The latter classifies an input by using the binary cross-entropy to make the probability high when an input comes from the training data.

Fig. 2.
figure 2

Our network architecture. The proposed network produces surface normal map from a single NIR image. The generative model reconstructs surface normal map and the discriminative network predicts the probability whether the surface normal map comes from the training data or the generative model.

3.2 Deep Shape from Shading

Based on the generative adversarial network explained in Sect. 3.1, we modified the GAN model to be suitable for the shape-from-shading problem. Since shape-from-shading is the ill-posed problem, it is important to incorporate proper constraints to uniquely determine the right solution. Therefore, we combine angular error and integrability loss, which are shown to be effective in many conventional SfS methods, into the objective function of the generative network. Also, the existing GAN approaches typically take a random noise vector [22], pre-encoded vector [30], or an image [31, 32] as the input of their generative networks, and each generative model produces the output which lies in the same domain as its input. In this work, we apply the generative model to produce a three-dimensional normal map from a NIR image where both data lies in the different domains. Compared to the conventional SfS methods, we do not need to calibrate the lighting directions. To the best of our knowledge, our work is the first application of the adversarial training to estimate fine-scale geometry from a single NIR image.

Generative Networks. We use a fully convolutional network to construct the generative network. This type of a convolutional model was recently adopted in image restoration [33, 34] and was verified to have superior performance in the task. To keep the image size of the input and output constant, we pad zeros before the convolution operations. Through our experiments, we found that this strategy works well in reconstructing the normal map.

Our network architecture is depicted in Fig. 2. We feed a 64\(\,\times \,\)64 NIR patch to the generative network as an input. The network consists of 5 convolution layers (128-256-256-128-3 convolution filters at each of layers), each followed by ReLU except the last layer. Since the generative network is fully convolutional, the output of the network has same size as the input NIR image. We have empirically determined the number and sizes of filters for all networks.

Discriminative Networks. Given the output of the generative network, a typical choice of the objectives function is the averaged \(L_1\) or \(L_2\) distance between ground-truth and generated output. However, such a choice has some limitations to be applied to our problem. \(L_2\) distance produces blurry predictions because it assumes that the errors follow the Gaussian distribution. In \(L_1\) distance, this effect could be diminished, but the estimated images would be the median of the set of equally likely intensities. We propose to add the discriminative network as a loss function with the distance metric. Recently, [31] proved that the combination of the distance, gradient and discriminative networks as a loss function provides the realistic and accurate output. Our discriminative model has a binary cross-entropy loss to make the high probability when the input is real images, and vice versa.

3.3 Training

We will explain how we iteratively train the generative model G and the discriminative model D. Let us consider a single NIR image \(Z \in \{{Z_1,Z_2,\ldots ,Z_j}\}\) from a training dataset and the corresponding ground truth normal map \(Y \in \{{Y_1,Y_2,\ldots ,Y_j}\}\). The training dataset covers various objects captured from diverse lighting directions, and we uniformly sampled the image from the dataset in terms of the balance of lighting directions.

Basically, we followed the procedure of the paper [30]. Given N paired image set, we first train D to classify the real image pair (ZY) into the class 1 and the generated pair (ZG(Z)) into the class 0. In this step, we fixed the parameters \((\theta _G)\) of the generative network G to solely update the parameters \((\theta _D)\) of D. The objective function of the discriminative model is denoted as:

$$\begin{aligned} \mathcal {L}_{D} (Z,Y) = \sum _{i=1}^{N}~\mathcal {D}_{bce}(Y_i,1) + \mathcal {D}_{bce}(G(Z_i),0), \end{aligned}$$
(2)

where \(\mathcal {D}_{bce}\) is the binary cross-entropy, defined as

$$\begin{aligned} \mathcal {D}_{bce}(Y_i,C) = - C_i log(Y_i) + (1-C_i)log(1-Y_i), \end{aligned}$$
(3)

where \(C_i\) is the binary class label. We minimize the objective function so that the network outputs high probability scores for real images \(Y_i\) and low probability scores for generated images \(G(Z_i)\).

After that, we keep the parameters of D fixed and train the generative model G. Many previous deep learning based image restoration and generation methods [33, 35] used the mean square error(MSE) loss function to minimize between the ground-truth images and output images. However, as studied in the conventional SfS works, estimating accurate surface normal maps requires the minimization of angular errors and the output normals satisfy the integrability constraint. Therefore, we modified the objective function of the GAN model to incorporate those photometric objective functions. By taking the objective functions, we can effectively remove angular error and estimate physically meaningful surface normals.

Specifically, to evaluate surface normal properly, we defined the objective function of our generative network as:

$$\begin{aligned} \mathcal {L}_{G} (Z,Y) = \sum _{i=1}^{N}~\mathcal {D}_{bce}(G(Z_i),1) + \lambda _{l_p}L_p + \lambda _{ang}L_{ang} + \lambda _{curl}L_{curl}. \end{aligned}$$
(4)

Following the conventional \(L_1\) or \(L_2\) loss, the estimated normal map difference \(\mathcal {L}_{p}\) is denoted as:

$$\begin{aligned} \mathcal {L}_{p}(Y,G(Z)) =||Y-G(Z)||_p^p \end{aligned}$$
(5)

where \(p=1\) or \(p=2\)

To estimate the accuracy of photometric stereo, the angular error is often used in conventional photometric approaches because it describes more physically meaningful error than direct normal map difference. To minimize the angular error, we normalize both the estimated normals (G(Z)) and the ground-truth normals (Y), then simply apply the dot product between them as:

$$\begin{aligned} \mathcal {L}_{ang}(Y,G(Z)) =1 - \langle Y, G(Z) \rangle = 1 - \frac{Y^TG(Z)}{||Y|| ||G(Z)||} \end{aligned}$$
(6)

The angular error provides physically meaningful measures, however it averaged entire surface normals. In order to encourage the generative network to estimate photometrically correct surface normals, we also add the integrability constraint in local neighbors into the objective function, which is denoted as:

$$\begin{aligned} \mathcal {L}_{curl} = ||\bigtriangledown \times G(Z) \rangle ||. \end{aligned}$$
(7)

The integrability constraint enforces that the integral of normal vectors in a local closed loop must sum up to zero, meaning that angles are returned to the same height. The integrability constraint prevents a drastic change and guarantees estimated normals lie on the same surface in a local region.

4 Experiment

4.1 Dataset

To apply deep learning framework to our purpose, it is required to have a good quality dataset with numerous examples for training. However, most existing datasets are not large enough to train the network and are often inadequate for our tasks. Recently Choe et al. [14] opened a new NIR benchmark dataset, including 101 real-world objects such as fabrics, leaves and paper taken at 9 views and 12 lighting directions.

We used a pair of NIR as input and surface normal maps as target for ground truth. For fine-scale refinement, we augmented NIR images into 12 patches (\(64\times 64\)) within a single ground truth. For training, we used images from 91 objects and the remaining objects are for validation and test dataset. Note that we uniformly sampled validation and test samples according to the object category. When we trained the network, we normalized NIR images and normal maps to \(-1\) and 1.

Fig. 3.
figure 3

Dataset [14] has various real-world object taken by 12 different lighting directions and 9 objects of view points. The leftmost is a normal map as the ground-truth and others are NIR images from different lighting directions. The Variety of lighting directions makes the same object appear vastly different.

Table 1. Network configuration.

4.2 Training Parameters

We provide parameters used to train our proposed network. The configuration of the network is depicted in Table 1. Training used batches of size 32. For initializing weights, we assigned a Gaussian distribution with zero mean and a standard deviation of 0.02. We trained all experiments using the Adam optimizer [36] with momentum\(~\beta _1 =0.5.\) The learning rate started from 0.0002 and decreased by a factor of 0.95 every 5000 iterations. For balancing the scale of normalization, we set a hyperbolic tangent at the end of the generative network. Lastly, we used a \(5\times 5\) sliding window with 3 pixels overlap to compute the integrability. In the optimization procedure, we used a combined loss function including intensity(\(L_p\)), angular(\(L_{ang}\)), and integrability constraint(\(L_{curl}\)). Note that we did not tune the weighted parameters of each loss functions and set them with the same weights, \(\lambda _p = \lambda _{ang} = \lambda _{curl} =1\).

4.3 Experimental Result

We use TensorflowFootnote 1 to implement and train the proposed network. The proposed network is a fully convolutional network, we apply the entire NIR image at evaluation. Computation time to estimate a surface normal is about 2 s with a Titan X, meanwhile the conventional shaped from shading method takes 10 min with Matlab implementation.

Table 2. Quantitative evaluation. We validate each terms of our cost functions with various error measures.
Table 3. Quantitative evaluation on a detail map. In this evaluation, we subtract low-frequency geometry variations from the results to focus on fine-scale surface geometry.

Quantitative Analysis. For the quantitative evaluation, firstly, we validate each terms of our cost functions. In this experiment, we tested our method using 3rd NIR direction among 12 lighting directions. To evaluate the performance of our method, we use three metrics; angular error, good pixel ratio and intensity error. In Table 2, all the quantitative errors are shown. Compared to case of using only intensity loss, when the angular cost function added, the performance is improved. This validates that our angular loss measures the physically meaningful error. The integrability term insures the continuity of the local normals. Although the integrability is satisfied for most of smooth surfaces, it does not guarantee performance improvement in some non-smooth surfaces. In our experiments, \(L_2 + L_{ang}\) loss function shows the best performance for all views case, and \(L_1 + L_{ang}\) achieves the lowest error for center view case. We compare our results with the conventional SfS method and we verified that our framework performs competitively. We also compare our method with the deep CNN-based surface normal estimation method [12]. Although this method estimates the surface normal, it is designed for reconstructing the scene-level low-frequency geometries and is not suitable for our purpose. We also measure errors for the single view which provides the best performance. Since extreme viewing directions are saturated or under-exposed in some cases, measuring the error of the single view results in lower errors. We found that estimated normal maps are distorted in extreme view points (error in low-frequency geometry). To evaluate the fine-scale (high-frequency) geometry, we define a detail map (M) based on the measure in [37]. This measure is computed as: \(M = f(Y) + G(Z) - f(G(Z))\), where function f is smoothing function. Table 3 shows the result.

Fig. 4.
figure 4

Qualitative results of surface normal estimation using the proposed network. From left to right: (a) input NIR images, (b) ground-truth normal maps, (c) normal maps from \(L_2\), (d) normal maps from \(L_2+L_{ang}\), (e) error maps of (d).

Fig. 5.
figure 5

Surface reconstruction results. From left to right: input, \(L_2\), \(L_2 + L_{ang}\), \(L_2 + L_{ang} + L_{curl}\) and ground-truth. We compute a depth map from a surface normal map, then reconstruct a mesh. All three cases are visualized.

Fig. 6.
figure 6

Surface normal reconstruction results from an arbitrary lighting direction. From left to right, the columns show the RGB images, NIR images, estimated surface normals, and reconstructed 3D models.

Qualitative Analysis. Figures 4 and 5 show the qualitative results of our network. Our network is able to estimate fine-scale textures of objects. Comparing between \(L_2\) and \(L_2 + L_{ang}\), we figure out that the angular loss provides more fine-scale textures than intensity loss. By adding the integrability constraint, the result produces a smoother surface. This demonstrates, therefore, that our network is trained to follow physical properties relevant to SfS.

4.4 Shape Estimation at Arbitrary Lighting Direction

We evaluate our network for the surface estimation with an arbitrary lighting direction. Without prior knowledge of the lighting directions, SfS becomes a more challenging problem. As shown in Fig. 6, we captured several real-world objects. The glove has a complex surface geometry. Note that the bumpy surface and the stitches at the bottom are reconstructed. The cap has a ‘C’ letter on it and the geometry of this is reconstructed in mesh result.

5 Conclusion

In this paper, we have presented a generative adversarial network for estimating surface normal maps from a single NIR image. As far as we aware, this is the first work to estimate fine-scale surface geometry from a NIR images using a deep CNN framework. The proposed network shows competitive performance without any lighting information. We demonstrated that our photometically-inspired object function improves the quality of surface normal estimation. We also applied our network to arbitrary NIR images which are captured under different configuration with the training dataset and have shown the promising results.

Limitation and Future Work. In our work, we did not take inter-reflections into account, which might produce inaccurate normals at concave regions. We also observed convexity/concavity ambiguity at some examples analogous to conventional SfS methods. Further study should be conducted to resolve this problem. Our reconstruction might suffer from distortions of low-frequency geometry as stated in Sect. 4. This is because we have relatively small amount of training data and we restrict our goal as estimating fine-scale geometry to train our network without overfitting to the limited training data. Despite we aimed reconstructing fine-scale surface geometry, we believe this can be further combined with various scene-level depth estimation techniques. Moreover, our network can be extended to estimate a lighting direction as well as surface normals, which can be a strong prior for conventional SfS methods.