Abstract
We present surface normal estimation using a single near infrared (NIR) image. We are focusing on reconstructing fine-scale surface geometry using an image captured with an uncalibrated light source. To tackle this ill-posed problem, we adopt a generative adversarial network, which is effective in recovering sharp outputs essential for fine-scale surface normal estimation. We incorporate the angular error and an integrability constraint into the objective function of the network to make the estimated normals incorporate physical characteristics. We train and validate our network on a recent NIR dataset, and also evaluate the generality of our trained model by using new external datasets that are captured with a different camera under different environments.
Y. Yoon and G. Choe—provided equal contributions to this work.
You have full access to this open access chapter, Download conference paper PDF
Similar content being viewed by others
Keywords
1 Introduction
Estimating surface geometry is a fundamental problem in understanding the properties of an object and reconstructing its 3D information. There are two different approaches: geometric methods such as structure-from-motion and multi-view stereo, and photometric methods such as photometric stereo and shape-from-shading. The geometric methods are usually useful for metric reconstructions while the photometric methods are effective in estimating accurate per-pixel surface geometry.
Recently, with the massive use of commercial depth sensors, e.g., Kinect and RealSense, many works have been proposed to enhance the depth quality of the sensors by fusing the photometric cues of the color image [1, 2] or the near infrared (NIR) image [3, 4]. Although these methods have proven their effectiveness in photometric shape estimation and have provided promising results, they rely highly on the sensors and usually require heavy computational time.
On the other hand, deep convolutional neural networks (CNN) have been broadly used for various computer vision tasks such as image classification [1.
The major contributions of our work are as follows:
• First work analyzing the relationship between an NIR image and its surface normal using a deep learning framework.
• Fine-scale surface normal estimation using a single NIR image where the light direction need not be calibrated.
• Suitable design of an objective function to reconstruct the fine details of a target object surface.
2 Related Work
Photometric Stereo and Shape from Shading. Photometric stereo [15] is one of the well-studied methods for estimating surface normals. By taking at least 3 images captured under different lighting directions, photometric stereo can determine a unique set of surface normals of an object. Also, the usage of more images makes the output increase in accuracy since it becomes an over-determined problem.
Shape from shading is a special case of photometric stereo, which predicts a shape from a single image. However, it is an ill-posed problem and needs to exploit many restrictions and constraints [16, 17]. Beginning with numerical SfS methods [18], many works have shown results based on the Lambertian BRDF assumption. Tsai et al. [19] use discrete approximation of surface normals. Lee and Kuo [20] estimate shape by using a triangular element surface model. We refer readers to [21] for better understanding regarding comparisons and evaluations of the classical SfS methods.
Shape from a NIR image has been recently studied in several literatures [3, 14]. They analyze the discriminative characteristics of NIR images and experimentally show the albedo (surface reflectance) simplicity in the NIR wavelength of various materials. In [3, 4], they propose the shape refinement methods using the photometric cues in NIR images. They show the high-quality shape recovery results, however they need an additional depth camera to obtain the results.
Although many conventional photometric approaches can work on NIR images and the albedo simplicity in the NIR image actually help robust estimation, estimating the surface normal from a single NIR image still have many limitations for practical uses, such as heavy computation time, heuristic assumptions, special system configuration, and the calibration of a light direction. To overcome those limitations, we study the map** from NIR intensity distributions to surface normal vectors via a deep CNN framework. We combine a GAN [22] with the specially designed objective function. Through the adversarial training process, our network naturally encodes the photometric cues of a scene and produces fine surface normals.
Data-Driven Shape Estimation. There have been various studies on estimating the shape information from images via data-driven approaches. Saxena et al. [23] estimate depths using a discriminatively trained MRF model with multiple scales of monocular cues. Hoiem et al. [24] reconstruct rough surface orientations of a scene by statistically modeling categories of coarse structures (e.g., ground, sky and vertical). Ladicky et al. [25] incorporate semantic labels of a scene to predict better depth outputs.
One of the emerging directions for shape estimation is using deep CNN. In [26], Fouhey et al. try to discover the right primitives in a scene. In [13], Wang et al. explore the effectiveness of CNNs for the tasks of surface normal estimation. Although this work infers the surface normals from a single color image, it outputs scene-level rough geometries and is not suitable for object-level detailed surface reconstruction. To estimate the object shape and the material property, Rematas et al. [\(\theta _D\) fixed, a generative parameter \(\theta _G\) is trained to produce the better quality of images, which could be misclassified by the discriminative network as real images. These procedures are repeated until they converge. This minimax objective is denoted as:
where \(D_{desire}\) is the distribution of images that we desired to estimate and \(D_{input}\) is that of the input domain. This objective function encourages D to be assigned to the correct label for both real and generated images and make G generate a realistic output F from an input Z. In our method, both the generative and the discriminative model are based on convolutional networks. The former takes a single NIR image as an input and results in a three-dimensional normal image as an output. The latter classifies an input by using the binary cross-entropy to make the probability high when an input comes from the training data.
3.2 Deep Shape from Shading
Based on the generative adversarial network explained in Sect. 3.1, we modified the GAN model to be suitable for the shape-from-shading problem. Since shape-from-shading is the ill-posed problem, it is important to incorporate proper constraints to uniquely determine the right solution. Therefore, we combine angular error and integrability loss, which are shown to be effective in many conventional SfS methods, into the objective function of the generative network. Also, the existing GAN approaches typically take a random noise vector [22], pre-encoded vector [30], or an image [31, 32] as the input of their generative networks, and each generative model produces the output which lies in the same domain as its input. In this work, we apply the generative model to produce a three-dimensional normal map from a NIR image where both data lies in the different domains. Compared to the conventional SfS methods, we do not need to calibrate the lighting directions. To the best of our knowledge, our work is the first application of the adversarial training to estimate fine-scale geometry from a single NIR image.
Generative Networks. We use a fully convolutional network to construct the generative network. This type of a convolutional model was recently adopted in image restoration [33, 34] and was verified to have superior performance in the task. To keep the image size of the input and output constant, we pad zeros before the convolution operations. Through our experiments, we found that this strategy works well in reconstructing the normal map.
Our network architecture is depicted in Fig. 2. We feed a 64\(\,\times \,\)64 NIR patch to the generative network as an input. The network consists of 5 convolution layers (128-256-256-128-3 convolution filters at each of layers), each followed by ReLU except the last layer. Since the generative network is fully convolutional, the output of the network has same size as the input NIR image. We have empirically determined the number and sizes of filters for all networks.
Discriminative Networks. Given the output of the generative network, a typical choice of the objectives function is the averaged \(L_1\) or \(L_2\) distance between ground-truth and generated output. However, such a choice has some limitations to be applied to our problem. \(L_2\) distance produces blurry predictions because it assumes that the errors follow the Gaussian distribution. In \(L_1\) distance, this effect could be diminished, but the estimated images would be the median of the set of equally likely intensities. We propose to add the discriminative network as a loss function with the distance metric. Recently, [31] proved that the combination of the distance, gradient and discriminative networks as a loss function provides the realistic and accurate output. Our discriminative model has a binary cross-entropy loss to make the high probability when the input is real images, and vice versa.
3.3 Training
We will explain how we iteratively train the generative model G and the discriminative model D. Let us consider a single NIR image \(Z \in \{{Z_1,Z_2,\ldots ,Z_j}\}\) from a training dataset and the corresponding ground truth normal map \(Y \in \{{Y_1,Y_2,\ldots ,Y_j}\}\). The training dataset covers various objects captured from diverse lighting directions, and we uniformly sampled the image from the dataset in terms of the balance of lighting directions.
Basically, we followed the procedure of the paper [30]. Given N paired image set, we first train D to classify the real image pair (Z, Y) into the class 1 and the generated pair (Z, G(Z)) into the class 0. In this step, we fixed the parameters \((\theta _G)\) of the generative network G to solely update the parameters \((\theta _D)\) of D. The objective function of the discriminative model is denoted as:
where \(\mathcal {D}_{bce}\) is the binary cross-entropy, defined as
where \(C_i\) is the binary class label. We minimize the objective function so that the network outputs high probability scores for real images \(Y_i\) and low probability scores for generated images \(G(Z_i)\).
After that, we keep the parameters of D fixed and train the generative model G. Many previous deep learning based image restoration and generation methods [33, 35] used the mean square error(MSE) loss function to minimize between the ground-truth images and output images. However, as studied in the conventional SfS works, estimating accurate surface normal maps requires the minimization of angular errors and the output normals satisfy the integrability constraint. Therefore, we modified the objective function of the GAN model to incorporate those photometric objective functions. By taking the objective functions, we can effectively remove angular error and estimate physically meaningful surface normals.
Specifically, to evaluate surface normal properly, we defined the objective function of our generative network as:
Following the conventional \(L_1\) or \(L_2\) loss, the estimated normal map difference \(\mathcal {L}_{p}\) is denoted as:
where \(p=1\) or \(p=2\)
To estimate the accuracy of photometric stereo, the angular error is often used in conventional photometric approaches because it describes more physically meaningful error than direct normal map difference. To minimize the angular error, we normalize both the estimated normals (G(Z)) and the ground-truth normals (Y), then simply apply the dot product between them as:
The angular error provides physically meaningful measures, however it averaged entire surface normals. In order to encourage the generative network to estimate photometrically correct surface normals, we also add the integrability constraint in local neighbors into the objective function, which is denoted as:
The integrability constraint enforces that the integral of normal vectors in a local closed loop must sum up to zero, meaning that angles are returned to the same height. The integrability constraint prevents a drastic change and guarantees estimated normals lie on the same surface in a local region.
4 Experiment
4.1 Dataset
To apply deep learning framework to our purpose, it is required to have a good quality dataset with numerous examples for training. However, most existing datasets are not large enough to train the network and are often inadequate for our tasks. Recently Choe et al. [14] opened a new NIR benchmark dataset, including 101 real-world objects such as fabrics, leaves and paper taken at 9 views and 12 lighting directions.
We used a pair of NIR as input and surface normal maps as target for ground truth. For fine-scale refinement, we augmented NIR images into 12 patches (\(64\times 64\)) within a single ground truth. For training, we used images from 91 objects and the remaining objects are for validation and test dataset. Note that we uniformly sampled validation and test samples according to the object category. When we trained the network, we normalized NIR images and normal maps to \(-1\) and 1.
4.2 Training Parameters
We provide parameters used to train our proposed network. The configuration of the network is depicted in Table 1. Training used batches of size 32. For initializing weights, we assigned a Gaussian distribution with zero mean and a standard deviation of 0.02. We trained all experiments using the Adam optimizer [36] with momentum\(~\beta _1 =0.5.\) The learning rate started from 0.0002 and decreased by a factor of 0.95 every 5000 iterations. For balancing the scale of normalization, we set a hyperbolic tangent at the end of the generative network. Lastly, we used a \(5\times 5\) sliding window with 3 pixels overlap to compute the integrability. In the optimization procedure, we used a combined loss function including intensity(\(L_p\)), angular(\(L_{ang}\)), and integrability constraint(\(L_{curl}\)). Note that we did not tune the weighted parameters of each loss functions and set them with the same weights, \(\lambda _p = \lambda _{ang} = \lambda _{curl} =1\).
4.3 Experimental Result
We use TensorflowFootnote 1 to implement and train the proposed network. The proposed network is a fully convolutional network, we apply the entire NIR image at evaluation. Computation time to estimate a surface normal is about 2 s with a Titan X, meanwhile the conventional shaped from shading method takes 10 min with Matlab implementation.
Quantitative Analysis. For the quantitative evaluation, firstly, we validate each terms of our cost functions. In this experiment, we tested our method using 3rd NIR direction among 12 lighting directions. To evaluate the performance of our method, we use three metrics; angular error, good pixel ratio and intensity error. In Table 2, all the quantitative errors are shown. Compared to case of using only intensity loss, when the angular cost function added, the performance is improved. This validates that our angular loss measures the physically meaningful error. The integrability term insures the continuity of the local normals. Although the integrability is satisfied for most of smooth surfaces, it does not guarantee performance improvement in some non-smooth surfaces. In our experiments, \(L_2 + L_{ang}\) loss function shows the best performance for all views case, and \(L_1 + L_{ang}\) achieves the lowest error for center view case. We compare our results with the conventional SfS method and we verified that our framework performs competitively. We also compare our method with the deep CNN-based surface normal estimation method [12]. Although this method estimates the surface normal, it is designed for reconstructing the scene-level low-frequency geometries and is not suitable for our purpose. We also measure errors for the single view which provides the best performance. Since extreme viewing directions are saturated or under-exposed in some cases, measuring the error of the single view results in lower errors. We found that estimated normal maps are distorted in extreme view points (error in low-frequency geometry). To evaluate the fine-scale (high-frequency) geometry, we define a detail map (M) based on the measure in [37]. This measure is computed as: \(M = f(Y) + G(Z) - f(G(Z))\), where function f is smoothing function. Table 3 shows the result.
Qualitative Analysis. Figures 4 and 5 show the qualitative results of our network. Our network is able to estimate fine-scale textures of objects. Comparing between \(L_2\) and \(L_2 + L_{ang}\), we figure out that the angular loss provides more fine-scale textures than intensity loss. By adding the integrability constraint, the result produces a smoother surface. This demonstrates, therefore, that our network is trained to follow physical properties relevant to SfS.
4.4 Shape Estimation at Arbitrary Lighting Direction
We evaluate our network for the surface estimation with an arbitrary lighting direction. Without prior knowledge of the lighting directions, SfS becomes a more challenging problem. As shown in Fig. 6, we captured several real-world objects. The glove has a complex surface geometry. Note that the bumpy surface and the stitches at the bottom are reconstructed. The cap has a ‘C’ letter on it and the geometry of this is reconstructed in mesh result.
5 Conclusion
In this paper, we have presented a generative adversarial network for estimating surface normal maps from a single NIR image. As far as we aware, this is the first work to estimate fine-scale surface geometry from a NIR images using a deep CNN framework. The proposed network shows competitive performance without any lighting information. We demonstrated that our photometically-inspired object function improves the quality of surface normal estimation. We also applied our network to arbitrary NIR images which are captured under different configuration with the training dataset and have shown the promising results.
Limitation and Future Work. In our work, we did not take inter-reflections into account, which might produce inaccurate normals at concave regions. We also observed convexity/concavity ambiguity at some examples analogous to conventional SfS methods. Further study should be conducted to resolve this problem. Our reconstruction might suffer from distortions of low-frequency geometry as stated in Sect. 4. This is because we have relatively small amount of training data and we restrict our goal as estimating fine-scale geometry to train our network without overfitting to the limited training data. Despite we aimed reconstructing fine-scale surface geometry, we believe this can be further combined with various scene-level depth estimation techniques. Moreover, our network can be extended to estimate a lighting direction as well as surface normals, which can be a strong prior for conventional SfS methods.
Notes
References
Han, Y., Lee, J.Y., Kweon, I.: High quality shape from a single RGB-D image under uncalibrated natural illumination. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1617–1624 (2013)
Yu, L.F., Yeung, S.K., Tai, Y.W., Lin, S.: Shading-based shape refinement of RGB-D images. In: Proceedings of the IEEE International Conference on Computer Vision (2013)
Choe, G., Park, J., Tai, Y.W., Kweon, I.S.: Exploiting shading cues in Kinect IR images for geometry refinement. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3922–3929 (2014)
Haque, S., Chatterjee, A., Govindu, V.: High quality photometric reconstruction using a depth camera. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2275–2282 (2014)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition (2015). ar**v:1512.03385
Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: Advances in Neural Information Processing Systems, pp. 1097–1105(2012)
Yoo, D., Park, S., Lee, J.Y., Paek, A.S., Kweon, I.S.: AttentionNet: aggregating weak directions for accurate object detection. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2659–2667 (2015)
Girshick, R., Donahue, J., Darrell, T., Malik, J.: Rich feature hierarchies for accurate object detection and semantic segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 580–587 (2014)
Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3431–3440 (2015)
Hong, S., Noh, H., Han, B.: Decoupled deep neural network for semi-supervised semantic segmentation. In: Advances in Neural Information Processing Systems, pp. 1495–1503 (2015)
Li, B., Shen, C., Dai, Y., van den Hengel, A., He, M.: Depth and surface normal estimation from monocular images using regression on deep features and hierarchical CRFS. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1119–1127 (2015)
Eigen, D., Fergus, R.: Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2650–2658 (2015)
Wang, X., Fouhey, D., Gupta, A.: Designing deep networks for surface normal estimation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 539–547 (2015)
Choe, G., Narasimhan, S.G., Kweon, I.S.: Simultaneous estimation of near IR BRDF and fine-scale surface geometry. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2016)
Woodham, R.J.: Photometric method for determining surface orientation from multiple images. Opt. Eng. 19(1), 191139–191139 (1980)
Zheng, Q., Chellappa, R.: Estimation of illuminant direction, albedo, and shape from shading. In: Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, CVPR 1991, 540–545. IEEE (1991)
Barron, J.T., Malik, J.: Shape, albedo, and illumination from a single image of an unknown object. In: 2012 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 334–341. IEEE (2012)
Ikeuchi, K., Horn, B.K.: Numerical shape from shading and occluding boundaries. Artif. Intell. 17(1–3), 141–184 (1981)
**-Sing, T., Shah, M.: Shape from shading using linear approximation. Image Vis. Comput. 12(8), 487–498 (1994)
Lee, K.M., Kuo, C.: Shape from shading with a linear triangular element surface model. IEEE Trans. Pattern Anal. Mach. Intell. 15(8), 815–822 (1993)
Zhang, R., Tsai, P.S., Cryer, J.E., Shah, M.: Shape-from-shading: a survey. IEEE Trans. Pattern Anal. Mach. Intell. 21(8), 690–706 (1999)
Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial nets. In: Advances in Neural Information Processing Systems, pp. 2672–2680 (2014)
Saxena, A., Chung, S.H., Ng, A.Y.: Learning depth from single monocular images. In: Advances in Neural Information Processing Systems, pp. 1161–1168 (2005)
Hoiem, D., Efros, A.A., Hebert, M.: Automatic photo pop-up. ACM Trans. Graph. (TOG) 24(3), 577–584 (2005)
Ladicky, L., Shi, J., Pollefeys, M.: Pulling things out of perspective. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 89–96 (2014)
Fouhey, D., Gupta, A., Hebert, M.: Data-driven 3D primitives for single image understanding. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 3392–3399 (2013)
Rematas, K., Ritschel, T., Fritz, M., Gavves, E., Tuytelaars, T.: Deep reflectance maps (2015). ar**v:1511.04384
Liu, F., Shen, C., Lin, G.: Deep convolutional neural fields for depth estimation from a single image. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5162–5170 (2015)
Eigen, D., Puhrsch, C., Fergus, R.: Depth map prediction from a single image using a multi-scale deep network. In: Advances in Neural Information Processing Systems, pp. 2366–2374 (2014)
Radford, A., Luke Metz, S.C.: Unsupervised representation learning with deep convolutional generative adversarial networks (2015). ar**v:1511.06434
Mathieu, M., Camille Couprie, Y.L.: Deep multi-scale video prediction beyond mean square error (2015). ar**v:1511.05440
Denton, E.L., Chintala, S., Fergus, R., et al.: Deep generative image models using a laplacian pyramid of adversarial networks. In: Advances in Neural Information Processing Systems, pp. 1486–1494 (2015)
Kim, J., Lee, J.K., Lee, K.M.: Accurate image super-resolution using very deep convolutional networks. (2015). ar**v:1511.04587
Dong, C., Loy, C.C., He, K., Tang, X.: Image super-resolution using deep convolutional networks (2015)
Flynn, J., Ivan Neulander, J.: Deepstereo: learning to predict new views from the world imagery. (2015). ar**v:1506.06825
Kingma, D., Ba, J.: Adam: a method for stochastic optimization (2014). ar**v:1412.6980
Nehab, D., Rusinkiewicz, S., Davis, J., Ramamoorthi, R.: Efficiently combining positions and normals for precise 3D geometry. ACM Trans. Graph. (TOG) 24(3), 536–543 (2005)
Acknowledgements
This research was supported by the Ministry of Trade, Industry & Energy and the Korea Evaluation Institute of Industrial Technology (KEIT) with the program number of 10060110.
Author information
Authors and Affiliations
Corresponding authors
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2016 Springer International Publishing AG
About this paper
Cite this paper
Yoon, Y., Choe, G., Kim, N., Lee, JY., Kweon, I.S. (2016). Fine-Scale Surface Normal Estimation Using a Single NIR Image. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds) Computer Vision – ECCV 2016. ECCV 2016. Lecture Notes in Computer Science(), vol 9907. Springer, Cham. https://doi.org/10.1007/978-3-319-46487-9_30
Download citation
DOI: https://doi.org/10.1007/978-3-319-46487-9_30
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-46486-2
Online ISBN: 978-3-319-46487-9
eBook Packages: Computer ScienceComputer Science (R0)