1 Introduction

Methods for reconstructing 3D objects from 2D images and videos have undergone remarkable improvements in recent years. In general, these proposals use specific databases for each object type, although there is a trend toward develo** general methods that compute 3D reconstruction for each object type [11] are used to segment guitars from images. Encoder–decoder networks usually consist of two phases: First, the feature maps are reduced to capture the semantic information; then, the spatial information is recovered by upsampling techniques. This approach has proven successful in segmentation [6, 7, 12, 13]. The Xception module, which modifies Inception V3 to improve performance on large data sets, is now used as the main backbone in server environments [5]. In encoder architectures, low-resolution features are separated from higher-resolution ones and recovered using the decoder. According to another approach, high-resolution representations should be maintained throughout the process by using a parallel network that connects the parts of the process and helps to reconstruct these features at the end. Recent work on high-resolution networks (HRNet) [14, 15] has shown very good performance.

2.2 3D object reconstruction

RGB image-based 3D reconstruction methods using convolutional neural networks (CNNs) have attracted increasing interest and shown impressive performance. Han et al. provided an overview of these methods in [16]. Hepperle et al. examined the quality of 3D reconstruction quality to enhance the experience in VR applications in [17].

The development of deep learning techniques, and in particular, the increasing availability of large training data sets, has led to a new generation of methods capable of recovering the 3D geometry and structure of objects from one or more RGB images without the complex process of camera calibration.

2.2.1 Volumetric representation techniques

Volumetric representations partition the space around a 3D object into a 3D grid and allow existing deep learning architectures developed for 2D image analysis, such as encoder–decoders, to be used for 3D processes.

Some methods deal with 3D volume reconstruction in the form of a voxelized occupancy grid, for example, in [18,19,3.1 Guitar segmentation and classification

We defined a set of segmentation and classification methods to extract the information of the guitar region appearing in the image and to verify that the guitar to be reconstructed meets the minimum processing requirements. We follow a framework of weak classifiers that can be combined sequentially to simplify the creation of the databases and their generalization to various other objects.

The proposed method starts with the segmentation of the guitar from the image, and then, a chain of classifiers and segmentation methods is applied to this first segmentation: We classify the segmented guitar into frontal/non-frontal classes to check whether the guitar is frontal to the camera. If the classification reveals that the guitar is frontal, the process continues with a second classifier that detects whether the guitar is electric or classical. This determines the type of template needed to correctly fit and reconstruct the guitar model. Finally, another segmentation is performed to extract the detected regions of the classical/electric guitar. This segmentation is used to align the 3D template with the orientation of the guitar and improve edge fitting during 3D reconstruction.

3.1.1 Guitar/non-guitar segmentation

To obtain a correct 3D reconstruction, an accurate segmentation of the guitar is required. We use the database and segmentation presented in [9] with 2, 200 RGB images of guitars (11,000 images after enhancement) to train and test the selected network. We randomly select 80% of the original data for training and the remaining 20% for testing. This database is also used for the classification methods explained in the following sections.

To obtain the best segmentation, we performed a full evaluation for three of the best CNNs for segmentation: Deeplabv3+ [11], where each CNN was trained from scratch with 40,000 iterations.

Table 1 Comparison on evaluation sets applying 40K training iterations

The performance of all networks with 40K iterations is shown in Table 1. As we can see, DeepLabv3+, HRNet and PGN achieved Mean Intersection Over Union (mIoU) of 88.47%, 95.31% and 96.61%, respectively. Figure 2a shows an example of the guitar segmentation achieved.

Fig. 2
figure 2

Examples of classification and segmentation. From left to right: a segmentation of guitar with [9]. b Frontal (top) and non-frontal (bottom) guitars. c Classical (top) and electric (bottom) guitars. d Labeling of the segmentation of the regions

Therefore, in our implementation, the PGN network is chosen to obtain a high-quality object segmentation. In Fig. 11, second column, we can see examples of guitar segmentation results obtained with this network.

3.1.2 Frontal/non-frontal guitar classification

To detect whether the guitar segmented in the previous step is frontal enough to be processed by our method, we developed a frontal/non-frontal classifier based on CNNs.

We use the guitar segmentation obtained in the previous step, cut into a square block with a black background, as input to our classifier to determine whether the guitar image is frontal or non-frontal. We use a CNN reference model that has shown correct classification results based on its appearance: ResNet50, a 50-layer residual network with correct performance on classification tasks [34]. The final number of samples per class after data augmentation was 123,625 frontal images and 151,000 non-frontal images.

Figure 2b shows an example of the images used in this classification process.

We adapted the database to the ResNet50 model and trained it on a GPU NVIDIA Titan X with 24 GB, with 6 epochs, a batch size of 16, stochastic gradient descent optimizer, and a learning rate of \(10^{-4}\). With this configuration, this CNN achieved a classification accuracy of 99.4%.

3.1.3 Classical/electric guitar classification

In our proposal, we use a different 3D template for classical and electric guitars to better fit the system to the actual shape of the instrument. Thus, we need to determine what type of guitar it is so that we can apply the correct template. From the 989 frontal guitars extracted in Sect. 3.1.2, we obtained 470 and 519 classical and electric guitars, respectively, of which 80% were randomly selected for training and the remaining 20% for testing. We then augmented them obtaining 58,750 classical and 64,875 electric guitars. Figure 2c shows sample images from this dataset. ResNet50 was trained with the same configuration as in Sect. 3.1.3, obtaining 98.3% accuracy.

3.1.4 Regions segmentation

This step is used to match the 3D template with the parts of the object, so that each region can be correctly located and placed when reconstructing the final 3D model. For each guitar type, we define different regions:

  • Classical guitar, five regions: Head, Neck, Body, Bridge and Hole.

  • Electric guitar, six regions: Head, Neck, Body, Bridge, Pickups and Controls.

Figure 2d shows a graphical representation of these regions.

To identify these regions in the segmented guitar, we use the PGN model [37, 38]. Since our templates are symmetric about the YZ plane, we do not need to check for a mirror transformation.

3.2.2 Boundary matching

The template is not a complete reconstruction of the model we are dealing with, but a rough approximation of the shape of one. Therefore, after aligning the template and the input silhouettes, we still need to find a boundary matching and perform silhouette war** in order to obtain any shape from the entire spectrum of possible shapes.

We need to find a boundary matching \(\omega \) between the silhouettes of our guitar template and the input guitar (see Fig. 4a). Given the contour of the segmented guitar \(\beta _g\), the pixels \(p_g \in \beta _g\) belonging to this contour, the contour of our template \(\beta _t\) and the pixels \(p_t \in \beta _t\) belonging to this contour, we want to warp \(\beta _t\) to its counterpart \(\beta _g\) to match the template to the real shape of the object. We are looking for a map** \(\omega \) that defines the correspondence between the pixels belonging to \(\beta _g\) and \(\beta _{t,\omega }\) by minimizing the distance between all the associated pixels of the contour of the template and the real contour of the segmented guitar:

$$\begin{aligned} \mathrm{arg min}_{\omega [0],...,\omega [m-1]} \sum _{i=0}^{m-1} \Vert (p_{g,i},p_{t,\omega [i]})\Vert _2 +\sigma (\omega [i],\omega [i+1]), \end{aligned}$$
(1)

where m is the number of pixels of the contour \(\beta _t\) and

$$\begin{aligned} \sigma (\omega [i],\omega [i+1]) = \left\{ \begin{array}{l} 1, \quad \,\,\text {if}\,\,\, 0\le \omega [i+1]-\omega [i] \le k \\ \infty , \quad \text {otherwise} \end{array}\right. \end{aligned}$$
(2)

Therefore, \(\sigma (\omega [i],\omega [i+1])\) penalizes jumps between associations larger than k pixels. In our implementation, \(k = 128\) leads to correct results, but this is closely related to the working resolution we use (at most 350 \(\times \) 350).

Fig. 4
figure 4

Boundary matching between the silhouettes of our guitar template (green) and an input guitar (red). Each point represents a pixel of the boundary. a Boundary matching associations; b example of a bad association, where a pixel of the input guitar’s neck is associated with a pixel of the template guitar’s body

Depending on the value of k and the shape of the guitar, bad associations may occur, for example, when a pixel of the guitar’s input neck is matched to the guitar template’s body (see Fig. 4b).

To solve this problem, we use the computed segmented regions. Boundary matching \(\omega _R\) is then computed for the individual masks of each region R (with a smaller constraint \(K = 32\)), and the resulting map**s can be combined. Each pixel \(p_g\) of the original silhouette \(\beta _g\) also belongs to the boundary \(\beta _{g,R}\) of at least one segmented region R, but a mapped pixel \(p_{t,R} \in \beta _{t,\omega }\) may or may not belong to the original silhouette of the template \(\beta _t\). We therefore keep those that belong to \(\beta _t\), obtaining an initial map** \(\omega _\mathrm{init}\) whose gaps can be easily filled. Let u and v be two indices of \(\beta _g\) that have a map** \(\omega _\mathrm{init}[u]\) and \(\omega _\mathrm{init}[v]\) belonging to \(\beta _t\), e.g., \(u < v\) and \(s = v - u\). We find a map** point in \(\beta _t\) proportional to all indices between u and v by decomposing the segment of \(\beta _t\) between \(\omega [u]\) and \(\omega [v]\) into s parts.

3.2.3 Occlusions

To solve the possible occlusions that musicians can create on guitars, we need to find an occlusion mask that indicates which parts of the guitar are occluded, but we also need to reconstruct the occluded parts of the boundary to get a reconstructed guitar mask. Finally, the map of the segmented regions should also be extended to cover the reconstructed mask.

Occlusion mask We use PGN [40] by using the boundary pixels of each region and their matching pixels from the template (which are in turn computed using the boundary matching algorithm from Sect. 3.2.2) as pivots.

3.2.5 Meshing

Fig. 8
figure 8

Stitching front and back meshes

We unproject each pixel of the warped depth maps to obtain the corresponding 3D vertex. Since war** the 2D silhouette can change the X and Y dimensions (making the silhouette larger or smaller in 2D), we scale the Z dimension accordingly to maintain the proportions of the guitar in all dimensions.

We create two triangles for each square of 4 pixels and get two meshes: one frontal and one posterior, which we stitch over the silhouette (see Fig. 8).

Finally, the entire mesh is smoothed using Laplacian smoothing.

3.2.6 Texture

The texture of the model can be obtained by directly projecting the texture of the masked guitar onto the front mesh. Therefore, the quality of the texture of the 3D model and its details are preserved from the original image. There are several aspects to consider in this process. First, we need to inpaint the input color image with the occlusion mask and the segmented regions in an occlusion. To do this, we fill each region of the occlusion mask by taking the largest possible patch from the unoccluded parts of the same region in the original color image. With such a patch, we synthesize a texture that covers the corresponding region of the occluded mask (using [41]), dilate it and paste it smoothly into the original occluded image. In this way, for example, guitar body is inpainted using only patches of the body. Figure 9 shows an example of this process.

Fig. 9
figure 9

Front texture inpainting

We limit our texture synthesis approach to regions where we can find a sufficiently large patch (between \(60\times 60\) and \(100\times 100\) pixels), and otherwise use Exemplar Inpainting [42, 43].

The resulting inpainted texture is then projected onto the front mesh as a color texture. For the back texture, we use a similar strategy for inpainting the occluded parts: We use [41] to synthesize a texture that covers the entire silhouette using the largest possible patch in the guitar body region of the front texture. Thus, we assume that the back of each guitar has the same color and texture as the body. Figure 10 shows several examples of back textures.

When stitching the front and back meshes, we also ensure that the corresponding front and back boundary vertices have the same UV coordinates in the final texture. This results in faces that map to the boundary of the texture as if we were stretching those pixels.

To add additional detail and relief, we also compute a bump map from the resulting front and back textures (we compute the horizontal and vertical derivatives of the grayscale textures and multiply them by a strength factor). All these texture operations are performed at the same resolution as the original input image to preserve the maximum texture quality.

Fig. 10
figure 10

Examples of synthesized back textures along with their extracted patches

4 Results and evaluation

The proposed system has been tested to evaluate its quality and numerical performance compared to other reference methods.

Figure 11 shows some results using some images from the Internet, while Fig. 12 compares some resulting models with the corresponding ground truth when renders of these models are used as input (those from the ShapeNet database [11 shows some examples in row 4, where the lower part of the guitar is not reconstructed correctly because it is occluded by the grass, and in row 7, where both hands occlude the guitar and the final reconstruction resolves these areas incompletely. If the segmentation of the guitar fails by inserting an element of the scenario into the object mask, this element can be inserted into the final 3D model. This is the case shown in Fig. 11 row 3, where the neck of the guitar is not segmented correctly and the support is inserted into the reconstruction mask.

In terms of computational cost, our implementation performs each of the steps explained in this work sequentially. Using a Windows 10 PC with an AMD Ryzen 7 3700X 8-core processor, 32 GB RAM and an Nvidia RTX 2700 GPU, our setup can generate the 3D model of a guitar from an image in about 2 min. This runtime could be optimized by parallelization and code optimization techniques.

6 Conclusions

In this paper, we presented a complete system for 3D reconstruction of objects in frontal RGB images based on template deformation focusing on guitars to explain the method. It allows realistic 3D reconstruction in shape and texture and solves possible occlusions that can hide some parts of the object.

Unlike other reference methods, we work with both shape and texture and take into account occlusions present in the images. Therefore, the 3D models of our reconstructed guitars are accurate and realistic and can be used in 3D virtual reconstructions. Moreover, we have shown that our pipeline can be adapted to other objects, provided that a suitable 3D template and specific segmentation and classification techniques are used. Compared to other reference methods based mainly on CNNs, our proposal simplifies the 3D reconstruction process by requiring less data and training to obtain a realistic reconstruction of 3D objects.

For future improvements, we plan to address 3D reconstruction from other viewpoints and multiview configurations and to conduct a perceptual study to validate our reconstructions in a virtual environment. In summary, we believe that the work presented in this paper is a step toward automatic and realistic 3D object reconstruction and will be useful in creating 3D content for virtual reality.