1 Introduction

As a fundamental topic in computer vision and graphics, 3D face reconstruction can be used for face recognition [1,2,3,4], face alignment [5,6,7,8], emotion analysis [9], and face animation [10]. Over the years, many novel approaches to 3D face reconstruction from a single image have been proposed. The major studies focus on how to achieve robust 3D face reconstruction with high fidelity, but the considerations of speed and cost are ignored or not mentioned. In practice, the speed and cost of production use should be valued. Spending much time reconstructing a 3D face model can lead to a poor user experience, and the speed of reconstruction is largely dependent on the performance of the method. The cost of a complete 3D face reconstruction method includes, but is not limited to, data collection cost, resource consumption cost, and usage cost. It will be difficult to meet the production needs of a high-fidelity 3D face reconstruction that takes excessive time to build or relies on special equipment. Thus, a balance needs to be struck between the various aspects.

Since a 2D image contains severely limited effective features, it is difficult to recover a detailed and highly realistic face model. Apart from the expensive device capture methods, studies have developed ways to conduct accurate 3D face reconstruction from a single 2D image based on the addition of prior knowledge. A traditional approach is to reconstruct a 3D face model by fitting a statistical model (e.g. 3D morphable model (3DMM) [11]). A mean face model with linear representations of face shape and texture is employed to fit a given image by optimization calculation. A limitation of model fitting methods is the restricted representation of a statistical model. They often fail to restore nonlinear facial details, making the 3D face model artificial.

In recent years, deep learning has become the more preferred approach for adding prior knowledge. Modelling a 3D face mesh is accomplished by learning the map** between the 2D image and the 3D face model. With the development of neural networks, learning-based methods enable the acquisition of accurate 3D face reconstruction. Nevertheless, there is a scarcity of 2D–3D paired face datasets available, and collecting a large-scale detailed 3D face dataset is difficult for ordinary users. In addition, learning 3D face reconstruction through a deep neural network involves a great deal of iterative computation, as well as processor and memory consumption. When dealing with a large amount of data, typical central processing unit (CPU) cores struggle to cope with such demands. This restricts the application of 3D face reconstruction on mobile phones even more. Moreover, the stability and robustness of the methods should be guaranteed.

In this paper, we aim to propose an efficient learning-based method of 3D face reconstruction from a single image. We strike a balance between quality, speed, and cost. In consideration of the cost of 3D data collection and preparation, we convert the task of 3D face reconstruction into a small set of 3DMM parameter regressions in the absence of accurate 3D data. To improve speed, we apply a lightweight network to extract features from images. Considering that the lightweight network is prone to loss of precision, we introduce an attention mechanism [12] and a graph convolution network (GCN) [13] in the regression. During training, three different loss functions are added to calculate the loss of 3DMM parameters, reconstructed 3D vertices, and landmarks. Various combinations of loss functions are explored to obtain the most efficient strategy. Furthermore, the performance of the proposed method is evaluated not only in a 3D face reconstruction benchmark but also in sparse face alignment.

To summarize, in this study, we concentrate on maintaining fast and robust 3D face reconstruction from a single image without accurate 3D training data. The main contributions are as follows.

  • We propose a lightweight network-based framework for 3D face reconstruction to address the problems of computation speed and graphics processing unit (GPU) dependency.

  • We propose a combination of an attention mechanism and a GCN for the regression of 3DMM parameters, which can improve the accuracy and robustness of the reconstructed model.

  • In our experiments, we validate the effectiveness of the proposed method and obtain an optimal result by comparing different loss function strategies. Compared with state-of-the-art methods, our method achieves considerable benefits.

2 Related work

3D-from-2D face reconstruction is a long-standing topic in the field of computer vision. A large number of complex optimization calculations need to be carried out to recover a 3D face model from an image. To simplify the problems, a common way to constrain the space of solutions is by adding prior knowledge. Relevant studies on single image-based 3D face reconstruction are mainly of two varieties: statistical model fitting methods and learning-based methods.

2.1 Statistical model fitting methods

Before the advent of deep learning, prior knowledge was embedded in a statistical face model. Specifically, an initialized mean face model is computed from a large dataset of 3D facial scans containing low-dimensional representations of the face shape and texture. The 3D face model is then fitted to an image through a series of optimization calculations so that the image generated by projecting the resulting 3D face model onto a 2D plane is as similar as possible to the input image. The most widely used models are based on 3DMM [1, 11]. Blanz and Vetter [11] proposed the first morphable 3D face model in 1999, using principal component analysis (PCA) for dimensionality reduction decomposition. Subsequently, many derivatives of 3DMM have appeared, such as the basel face model (BFM) [2], a widespread model that can fit any 3D face and store its 3DMM parameters. The first BFM is unable to adjust facial expressions. A typical strategy is to join the expression basis from FaceWarehouse [14], such as the methods of [5] and [7]. The integrated BFM has 199-dimensional shape vectors, 199-dimensional texture vectors, and 29-dimensional expression vectors.

In general, the core of statistical model fitting methods is to find the optimal solution to 3DMM parameters that minimizes the difference between each input and rendered image. Piotraschke and Blanz [15] reconstructed a 3D face from a set of face images by reconstructing each face individually and then combining them into a final shape based on the accuracy of each reconstructed part. ** et al. [16] took two images of a person’s front and side view as input to develop a deformable nonnegative matrix factorization (NMF) part-based 3D face model and used an automated iterative reconstruction method to obtain a high-fidelity 3D face model. In contrast, both Jiang et al. [17] and Liu et al. [18] proposed 3D face reconstruction methods based on a single image. The former used a bilinear face model and local corrected deformation fields to reconstruct high-quality 3D face models, while the latter improved accuracy by updating contour landmarks and self-occluded landmarks. However, these methods have obvious shortcomings. Building such a 3DMM model by computing a nonlinear error function requires expensive iterative optimization. It also tends to get stuck in local minima so that its accuracy and authenticity cannot be guaranteed. Aldrian and Smith [19, 20] suggested a solution that used linear methods to recover shape and texture separately. In addition, Schönborn et al. [21] proposed a different model fitting method, regarding the 3DMM fitting process as a probabilistic reasoning problem. They interpreted 3DMM as a generative Bayesian model and used random forests as noisy detectors. The two were then combined using a data-driven Markov chain Monte Carlo method (DDMCMC) based on the Metropolis–Hastings algorithm.

2.2 Learning-based methods

Thanks to the rapid development of deep learning, a 3D face model can be recovered from a single image by using a convolutional neural network (CNN) to encode prior knowledge in the weights of the trained network. Nonetheless, there are numerous challenges in reconstructing 3D face models from images.

First, the problems of training data should be fixed, mainly in terms of data volume and data diversity. Many 3D face reconstruction methods often perform poorly on images where the facial features are partially occluded or self-occluded due to large poses because of the reduction of valid features. One solution is to expand the training data [5, 6, 22, 23]. Richardson et al. [22] randomly modified 3DMM and rendered them onto a 2D plane to generate synthetic 2D images. Zhu et al. [5] proposed synthesizing 3D faces by directly regressing the 3DMM parameters from the input images. The other solution is generally through strong regularization of shape [24]. Focusing on occlusion and pose problems, Ruan et al. [25] proposed a self-aligned dual-face regression network combined with an attention mechanism to solve them.

The second problem is improving accuracy and robustness. Deng et al. [26] proposed an accurate 3D face reconstruction by introducing a differentiable renderer and designing a hybrid loss function for weakly supervised training. Sanyal et al. [27] used multiple images of the same subject and an image of a different subject to learn shape consistency and inconsistency during training to enhance robustness. Recent studies [28,29,30] have demonstrated that GCNs [13, 31, 32] contribute to the recovery of facial details. Lin et al. [28] obtained high-fidelity 3D face models by utilizing a GCN to decode the features extracted from single face images and then produce detailed colours for the face mesh vertices. Lee and Lee [29] introduced an uncertainty-aware mesh encoder as well as a decoder that combined a GCN with a generative adversarial network (GAN), to solve the problems of occlusion and blur. Based on GCNs, Gao et al. [30] proposed decoding the identity and expression features extracted from a CNN to recover 3D face shape and albedo, respectively.

A further concern is the speed and cost of 3D face reconstruction. Most of the above methods regress 3DMM parameters based on a deep convolutional neural network (DCNN). The size of the trained network is usually large, with numerous parameters and computations, resulting in slow inference and memory consumption. The inference time would be much longer on a CPU alone, or the huge amount of computing could not even be handled by a CPU core. To overcome these shortcomings, solutions can be found by reducing the 3DMM parameters for regression [8] or using an image-to-image CNN instead of a regression network [6, 33]. Feng et al. [6] designed a novel method to record the 3D shape of a face using a UV position map, which enabled fast reconstruction. Koizumi and Smith [33] estimated the correspondence from an image to a face model based on an image-to-image CNN without ground truth or landmarks. Guo et al. [8] reduced the dimensions of 3DMM parameters and performed fast and stable 3D face reconstruction based on a lightweight CNN.

Fig. 1
figure 1

Framework of the proposed method

3 Proposed method

In this section, we introduce our work in detail. First, we describe the composition of 3DMM. We then detail each component of the proposed network architecture. Specifically, there are two modules: one for fast feature extraction based on a lightweight network and the other for parameter regression combined with an attention mechanism and a GCN. After that, we introduce three loss functions used in training. Our framework is shown in Fig. 1.

3.1 3DMM parameter regression

The face shape and texture of 3DMM can be defined as:

$$\begin{aligned} {\varvec{\mathrm{S}}}_{\mathrm{model}} = \bar{{\varvec{\mathrm{S}}}} + {\varvec{\mathrm{B}}}_{\mathrm{shp}}{\varvec{\alpha }}_{\mathrm{shp}}, \nonumber \\ {\varvec{\mathrm{T}}}_{\mathrm{model}} = \bar{{\varvec{\mathrm{T}}}} + {\varvec{\mathrm{B}}}_{\mathrm{tex}}{\varvec{\alpha }}_{\mathrm{tex}}, \end{aligned}$$
(1)

where \({\varvec{\mathrm{S}}}_{\mathrm{model}}\) and \({\varvec{\mathrm{T}}}_{\mathrm{model}}\) are the face shape vector and texture vector, respectively; \(\bar{{\varvec{\mathrm{S}}}}\) and \( \bar{{\varvec{\mathrm{T}}}} \) are the mean face shape and texture, respectively; \( {\varvec{\mathrm{B}}}_{\mathrm{shp}} \) and \( {\varvec{\mathrm{B}}}_{\mathrm{tex}} \) are the PCA bases of face shape and texture, respectively; and \( {\varvec{\alpha }}_{\mathrm{shp}} \) and \( {\varvec{\alpha }}_{\mathrm{tex}} \) represent the corresponding parameters.

Typically, a full 3DMM parameter regression still needs to estimate pose parameters, illumination parameters, and camera parameters, so that the output model can be projected onto a plane and compared for similarity with the input image. For the purpose of fast 3D face reconstruction, we remove some of the parameters and reduce the dimensions of the remaining parameters, referring to previous studies [5, 7, 8]. Therefore, we only learn the 3DMM parameters of shape, expression, and pose in the regression task. Here, the 3D shape in Eq. 1 is described as:

$$\begin{aligned} {\varvec{\mathrm{S}}}_{\mathrm{model}} = {\varvec{\bar{\mathrm{S}}}} + {\varvec{\mathrm{B}}}_{\mathrm{shp}}{\varvec{\alpha }}_{\mathrm{shp}} + {\varvec{\mathrm{B}}}_{\mathrm{exp}}{\varvec{\alpha }}_{\mathrm{exp}}, \end{aligned}$$
(2)

where the 3D expression base \( {\varvec{\mathrm{B}}}_{\mathrm{exp}} \) and corresponding parameters \( {\varvec{\alpha }}_{\mathrm{exp}} \) are added. Given a face image, the network estimates a vector with 62 dimensions \(({\varvec{\mathrm{T}}}, {\varvec{\alpha }}_{\mathrm{shp}}, {\varvec{\alpha }}_{\mathrm{exp}})\in {\mathbb {R}}^{62}\), where \({\varvec{\mathrm{T}}}\in {\mathbb {R}}^{3\times 4}\) is a transformation matrix representing the face pose. \({\varvec{\alpha }}_{\mathrm{shp}}\in {\mathbb {R}}^{40}\) and \({\varvec{\alpha }}_{\mathrm{exp}}\in {\mathbb {R}}^{10}\) are the 3DMM parameters of shape and expression, respectively. After regression, the 3D face can be computed with Eq. 2.

3.2 Network architecture

As shown in Fig. 1, we propose a fast parameter regression strategy based on a lightweight network, combined with an attention mechanism and a GCN. First, we employ a lightweight network, i.e. MobileNet [34], to extract features from images quickly. According to Eq. 2, constructing a 3D face model mainly depends on the estimation of shape parameters and expression parameters, which influence the final performance of the face model. As a result, we then separate the 3DMM parameters into two parts for regression. One is the parameter regression of shape and expression, where the attention mechanism and GCN are introduced to improve robustness and stability. The other is the regression of pose parameters, performed by a fully connected layer.

3.2.1 Fast feature extraction

Adopting a lightweight network enables fast and stable feature extraction. In this regard, we choose MobileNet [34] to extract features from images. Instead of standard convolutions, MobileNet introduces depthwise separable convolutions. When performing convolutions, a standard convolution kernel considers all channels in the corresponding image region simultaneously, which increases the computation multiplicatively. The depthwise separable convolution factorizes a convolution into a depthwise convolution and a pointwise convolution, decoupling channel correlation and spatial correlation. In this way, both the number of parameters and computational cost are greatly reduced, but accuracy is still guaranteed. The feature extraction network is a modification of MobileNet-V1 [34], where the last fully connected layer is replaced with two branches for the next regression step.

3.2.2 Enhanced attention to facial features

An attention mechanism is introduced with the aim of concentrating on the context-aware representation of facial features from the extracted feature map and suppressing other useless information, such as image backgrounds. With reference to [35], we generate attention masks \({\varvec{\mathrm{M}}}\) and the transformed feature map \({\varvec{\mathrm{X}}}\) from the extracted features. The attention masks \({\varvec{\mathrm{M}}}\) are treated as weights of different channels, which are then multiplied by the transformed feature map \({\varvec{\mathrm{X}}}\) to construct the final content-aware matrix \({\varvec{\mathrm{H_c}}}\). Specifically, the formula can be defined as:

$$\begin{aligned} {\varvec{\mathrm{H}}}^k_c = \sum \limits _{i=1}^{h}\sum \limits _{j=1}^{w}m^k_{ij}{\varvec{\mathrm{x}}}_{ij}, \end{aligned}$$
(3)

where \(m_{ij}^k\) denotes the kth weight of the attention mask, \({\varvec{\mathrm{x}}}_{ij}\) denotes the feature vector of the transformed feature map \({\varvec{\mathrm{X}}}\) at (ij), and h and w represent the height and width of the input image, respectively.

3.2.3 Graph convolutional network for robust parameter regression

Unlike CNNs, GCNs can perform convolution operations on non-Euclidean structured data. Accordingly, we introduce a static GCN and a dynamic GCN [35] to help restore unstructured details of the face model.

For the static GCN, any static graph convolutional layer can be defined as:

$$\begin{aligned} {\varvec{\mathrm{H}}}^{l+1} = \sigma ({\varvec{\mathrm{A}}}{\varvec{\mathrm{H}}}^l{\varvec{\mathrm{W}}}), \end{aligned}$$
(4)

where \({\varvec{\mathrm{H}}}^l\) and \({\varvec{\mathrm{W}}}\) are the input nodes and weight matrix of the lth layer, respectively, \({\varvec{\mathrm{A}}}\) is the adjacency matrix, and \(\sigma (\cdot )\) denotes the nonlinear activation function. Here, \({\varvec{\mathrm{H}}}^0={\varvec{\mathrm{H}}}_c\) is the input nodes of the single-layer static GCN, which comes from Eq. 3. Thus, the formula of the static GCN is defined as:

$$\begin{aligned} {\varvec{\mathrm{H}}}_s = \sigma ({\varvec{\mathrm{A}}}_s{\varvec{\mathrm{H}}}_c{\varvec{\mathrm{W}}}_s), \end{aligned}$$
(5)

where \({\varvec{\mathrm{H}}}_s\) denotes the updated nodes, \({\varvec{\mathrm{A}}}_s\) and \({\varvec{\mathrm{W}}}_s\) are the adjacency matrix and the weight matrix of the static GCN, respectively, and the activation function \(\sigma (\cdot )\) is LeakyReLU. The multiplication of the adjacency matrix \( {\varvec{\mathrm{A}}}_s \) with the features \( {\varvec{\mathrm{H}}}_c \) is equivalent to the sum of the features of the neighbouring nodes of a node. In this way, each node can use the information of neighbours to update the state.

Compared with the static GCN, a significant characteristic of the dynamic GCN is the dynamic adjacency matrix update. Since the adjacency matrix \({\varvec{\mathrm{A}}}_s\) of the static GCN is fixed, it is unreasonable to use the same adjacency matrix for all inputs. The dynamic GCN can overcome this weakness by adaptively constructing the adjacency matrix \({\varvec{\mathrm{A}}}_d\) according to the input features. Intuitively, recalculating the adjacency matrix for each input can better spread information between similar structures and speed up the learning of local semantic information. Specifically, the dynamic GCN can be defined as:

$$\begin{aligned} {\varvec{\mathrm{H}}}_d = \sigma ({\varvec{\mathrm{A}}}_d{\varvec{\mathrm{H}}}_s{\varvec{\mathrm{W}}}_d), \end{aligned}$$
(6)

where \({\varvec{\mathrm{H}}}_d\) are the output 3DMM parameters, \({\varvec{\mathrm{W}}}_d\) is the state-update weight matrix of the dynamic GCN, and the adjacency matrix \({\varvec{\mathrm{A}}}_d\) of the dynamic GCN is defined as:

$$\begin{aligned} {\varvec{\mathrm{A}}}_d = \delta ({\varvec{\mathrm{W}}}_a{\varvec{\mathrm{H}}}^{\prime }_s), \end{aligned}$$
(7)

where \(\delta (\cdot )\) is the Sigmoid activation function, \(W_a\) is the weight matrix obtained by convolution, and \({\varvec{\mathrm{H}}}^{\prime }_s\) is constructed by concatenating \({\varvec{\mathrm{H}}}_s\) with its global expression.

3.3 Loss functions

During training, we adopt three loss functions to handle the optimizer: fast weighted parameter distance cost (fWPDC) [8], vertex distance cost (VDC) [5], and landmark loss.

3.3.1 Fast weighted parameter distance cost

Generally, the WPDC [5] is used to constrain the loss between the predicted parameters and ground truth. The WPDC sets different weights for each parameter. The formula can be defined as:

$$\begin{aligned} L_{\mathrm{wpdc}} = \Vert {\varvec{\mathrm{w}}} \cdot ({\varvec{\mathrm{p}}} - {\varvec{\mathrm{p}}}^{gt})\Vert ^2, \end{aligned}$$
(8)

where \({\varvec{\mathrm{w}}}=(w_1,w_2,\ldots ,w_{62})\) denotes the weight of each parameter, \({\varvec{\mathrm{p}}}=(p_1,p_2,\ldots ,p_{62})\) is the predicted 3DMM parameter, and \({\varvec{\mathrm{p}}}^{gt}=(p_1^{gt},p_2^{gt},\ldots ,p_{62}^{gt})\) is the ground truth. To simplify the calculation, fWPDC [8] separates the parameters into two parts for calculation. That is, \( {\varvec{\mathrm{p}}}=[{\varvec{\mathrm{T}}},{\varvec{\alpha }}] \), where \( {\varvec{\mathrm{T}}} \) is a transformation matrix from the predicted pose parameters and \({\varvec{\alpha }} = [{\varvec{\alpha }}_{\mathrm{shp}}, {\varvec{\alpha }}_{\mathrm{exp}}]\) are the predicted shape parameter and expression parameter. Equation 8 is changed to:

$$\begin{aligned} L_{\mathrm{fwpdc}} = \Vert {\varvec{\mathrm{w}}}_{\mathrm{T}} \cdot ({\varvec{\mathrm{T}}} - {\varvec{\mathrm{T}}}^{gt})\Vert ^2 + \Vert {\varvec{\mathrm{w}}}_\alpha \cdot ({\varvec{\alpha }} - {\varvec{\alpha }}^{gt})\Vert ^2, \end{aligned}$$
(9)

where \( {\varvec{\mathrm{w}}}_{\mathrm{T}} \) and \( {\varvec{\mathrm{w}}}_\alpha \) are the weights of the corresponding parameters, and \({\varvec{\mathrm{T}}}^{gt}\) and \({\varvec{\alpha }}^{gt}\) refer to the ground truth of the corresponding parameters.

3.3.2 Vertex distance cost

The VDC [5] minimizes the vertex distances between the reconstructed 3D face shapes and the ground truth. The vertices of a 3D face model are generated by the predicted 3DMM parameters \( {\varvec{\mathrm{p}}} \). The formula is defined as:

$$\begin{aligned} L_{\mathrm{vdc}} = \Vert {V_{3d}({\varvec{\mathrm{p}}}) - V_{3d}({\varvec{\mathrm{p}}}^{gt})}\Vert ^2, \end{aligned}$$
(10)

where \(V_{3d}(\cdot )\) is the reconstructed vertices calculated by:

$$\begin{aligned} V_{3d}({\varvec{\mathrm{p}}}) = {\varvec{\mathrm{T}}} * \begin{bmatrix} {\varvec{\mathrm{S}}}_{\mathrm{model}} \\ 1 \end{bmatrix}, \end{aligned}$$
(11)

where \({\varvec{\mathrm{T}}}\) is the same as that in Eq. 9 and \( S_{\mathrm{model}} \) is from Eq. 2.

3.3.3 Landmark loss

To improve the robustness of face reconstruction, we adopt sparse landmark loss to constrain the 3DMM parameters to better fit the input. We additionally detect 68 facial landmarks \(\{q_n^{gt}\}\) of each input image as ground truth. During training, we obtain 2D landmarks \(\{q_n\}\) by projecting the 3D landmark vertices of the reconstructed model onto images. Then, the loss is formulated as:

$$\begin{aligned} L_{\mathrm{lmk}} = \frac{1}{N}\sum \limits _{n=1}^N\Vert q_n - q^{gt}_n\Vert . \end{aligned}$$
(12)

Here, N is 68. We obtain the loss values by means of Euclidean distance computation.

4 Experiment

4.1 Datasets and evaluation metrics

In this study, we aim to produce fast and robust 3D face reconstruction from a single image without the use of accurate 3D training data. As a result, 300W-LP [5] is chosen as the training dataset. 300W-LP is composed of synthesized 3D faces with large poses from 300W [36], which includes the annotated faces in the wild (AFW) [37], labeled face parts in the wild (LFPW) [38], HELEN [39], IBUG [36], and extended Multi-Modal Verification for Teleservices and Security Applications (XM2VTS) [40] datasets. Practically, the training dataset we used consists of over 600,000 still images, extended by Guo et al. [8]. Since the extended dataset does not contain ground truth of facial landmarks, we adopt a face align network (FAN) [41] to extract 68 2D facial landmarks of each image as ground truth. During the collection of landmarks, we remove the samples that failed to be detected. In total, there are 626,088 as the training set as well as 50,807 for validation.

To evaluate the performance of our method on 3D face reconstruction, we employ the not quite in the wild (NoW) [27] dataset. The NoW dataset contains 2,054 2D images of 100 subjects, with a 3D face scan for each subject. The images are categorized into four cases: neutral, expression, occlusion, and selfie. We follow the NoW benchmark [27] to evaluate the performance of 3D face reconstruction. Specifically, the benchmark calculates the scan-to-mesh distance between the ground truth scan and the reconstructed mesh. The median distance, mean distance, and standard deviation are then recorded, as well as a cumulative error plot for all distances.

Most methods of face reconstruction support face yaw angles of less than 45\(^\circ \) or when all facial landmarks are visible, which is not able to align faces in extreme cases such as large poses up to 90\(^\circ \). To test the performance of our method on images with large poses, we evaluate sparse face alignment accuracy with small, medium, and large yaw angles (i.e. yaw angle \(\psi \) corresponding to 0\(^\circ \) \(\le \) \(\psi \) \(\le \)30\(^\circ \), 30\(^\circ \) \(\psi \) \(\le \)60\(^\circ \), and 60\(^\circ \) \(\psi \) \(\le \)90\(^\circ \), respectively) using the normalized mean error (NME) by bounding box size on AFLW2000-3D according to [5]. AFLW2000-3D consists of fitted 3D face models of the first 2,000 samples from the annotated facial landmarks in the wild (AFLW) [42] and corresponding 68 3D facial landmarks.

Additionally, we introduce the CelebFaces Attribute (CelebA) [43] dataset in the qualitative analysis. The CelebA dataset contains 202,599 face images with 10,177 celebrities and 40 attribute annotations, including large poses, occlusion, blur, and background clutter. We group the images by certain attributes based on experimental needs. Then, we evaluate the quality of the 3D face models reconstructed from the images with different attributes.

4.2 Implementation details

We perform experiments mainly based on PyTorch. For feature extraction, we utilize MobileNet-V1 [8], using VDC from scratch may obtain higher vertex error, but better results can be achieved by using VDC from fWPDC or combining VDC and fWPDC. Thus, we conduct separate experiments using three different combinations of loss functions. The first combination strategy is to train our network using fWPDC and landmark loss:

$$\begin{aligned} L_{1} = L_{\mathrm{fwpdc}} + w_{\mathrm{lmk}}L_{\mathrm{lmk}}, \end{aligned}$$
(13)

where \(w_{\mathrm{lmk}} \approx 10^{m_{\mathrm{fwpdc}}-m_{\mathrm{lmk}}}\) is the training weight used to balance the two losses, and \(m_{\mathrm{fwpdc}}\) and \(m_{\mathrm{lmk}}\) indicate the magnitude of each loss. The second is to calculate three loss functions simultaneously:

$$\begin{aligned} L_{2} = L_{\mathrm{fwpdc}} + w_{\mathrm{vdc}}L_{\mathrm{vdc}} + w_{\mathrm{lmk}}L_{\mathrm{lmk}}, \end{aligned}$$
(14)

where \(w_{\mathrm{vdc}} \approx 10^{m_{\mathrm{fwpdc}}-m_{\mathrm{vdc}}}\) has the same effect as \(w_{\mathrm{lmk}}\) and \(m_{\mathrm{vdc}}\) indicates the magnitude of VDC. The last strategy is to divide the training into two stages, with different loss functions calculated for each stage. The first stage is to train the network using Eq. 13. When the training converges to fit, we adjust the loss combination in the second stage to:

$$\begin{aligned} L_{3} = w_{\mathrm{vdc}\prime }L_{\mathrm{vdc}} + L_{\mathrm{lmk}}, \end{aligned}$$
(15)

where \(w_{\mathrm{vdc}\prime } \approx 10^{m_{\mathrm{lmk}}-m_{\mathrm{vdc}}}\).

4.3 Ablation study

We conduct comparative experiments with different schemes to verify the effectiveness of the proposed method. In our experiments, we combine the modified MobileNet, attention mechanism, and GCN for training and search for the best result with different training strategies. The original MobileNet-V1 is employed as a baseline.

4.3.1 Evaluation on the NoW validation set

We perform experiments on the NoW [27] validation set in ablation experiments, and the results are shown in Table 1. Obviously, the combined network we proposed is effective. As shown in the first and second rows of Table 1, performance improves when the attention mechanism and GCN are applied, where the median error, mean error, and standard deviation error are reduced by 0.07 mm, 0.08 mm, and 0.07 mm, respectively, on the NoW validation set. Throughout the three strategies we proposed, the effect is generally enhanced when other loss functions are added. When joined by landmark loss (i.e. using \( L_1 \)), the effect is not significant. However, when experimenting with the network trained with three loss functions together (i.e. using \( L_2 \)), the errors are reduced, drop** by 0.04 mm on the median error and 0.03 mm on the mean error, compared with using \( L_1 \). Finally, we achieve the best results using the third strategy to train the network, namely, using \( L_1 \) with \( L_3 \) fine-tuning. Compared with using fWPDC only, the median error is reduced by 0.06 mm, the mean error is reduced by 0.06 mm, and the standard deviation error is reduced by 0.02 mm.

Table 1 Reconstruction error on the NoW validation set
Table 2 NME (\(\%\)) of 68 landmarks with different yaw angles \(\psi \) on AFLW2000-3D
Fig. 2
figure 2

Comparison for sparse alignment on AFLW2000-3D. Partial face regions are magnified for better visual comparison

Fig. 3
figure 3

Visualization of the learned feature maps. The images in the first row are the input images, the images in the second row are heatmaps generated by MobileNet-V1, and the images in the third row are heatmaps generated by the attention module of our method

4.3.2 Evaluation on AFLW2000-3D

Similar results are obtained in the evaluation of sparse face alignment on AFLW2000-3D [5], and the results can be seen in Table 2. Compared with the baseline, the combined network we proposed performs better, with a mean NME reduction of 0.20\(\%\) when using the same loss function. However, unlike the above 3D face reconstruction evaluation results, there is a significant reduction in errors with the addition of landmark loss (i.e. using \(L_1\)), decreasing the mean NME by 0.41\(\%\). The trained network with \(L_2\) performs best at yaw angles ranging from 0\(^\circ \) to 30\(^\circ \), but does not perform as well as the network trained by \(L_1\) with \(L_3\) fine-tuning when yaw angles exceed 30\(^\circ \). They differ from each other in the mean error by 0.02\(\%\).

To show the effect of the attention mechanism and GCN more clearly, we visualize the sparse alignment results of the model trained with and without the attention mechanism and GCN in Fig. 2. In the demonstrated cases, the model with the attention mechanism and GCN obtains more accurate alignment results. In the first example with the attention mechanism and GCN, the facial contour landmarks fit the input face much more precisely. In the second example, the landmarks labelled in the nose region are more correctly located. In the third example, the labelling of the mouth’s openness is more appropriate. In the last example, in the case of a relatively large yaw of the face pose, the result with the added module performs better.

4.3.3 Visualization of feature maps

To demonstrate the utility of using an attention mechanism to capture facial features, we visualize images from AFLW2000-3D [5] with their corresponding learned feature maps based on class activation map** (CAM) [44]. Some samples are shown in Fig. 3. The heatmaps of the images are generated from MobileNet-V1 and the attention module of the proposed method in the case of using the same training loss functions. We note that the attention mechanism we employ is capable of paying attention to the face region. Comparing the third, fourth, fifth, and sixth columns of Fig. 3, our method shows robustness for images with different poses. Even for images with extremely large poses, the proposed attention module still focuses on the crucial parts. Moreover, as shown in the last two columns of Fig. 3, our method improves attention to nonocclusion regions.

4.4 Comparison with prior art

4.4.1 Qualitative analysis

For qualitative evaluation, we compare the resulting 3D shapes reconstructed from images of different properties on AFLW2000-3D [5] and CelebA [43] using different methods. Specifically, we compare our method with PRNet [6] and 3DDFA-V2 [8], which proposed fast 3D face reconstruction without accurate 3D training data, similar to our method. Some of the results are shown in Fig. 4.

First, we reconstruct the images with different yaw angles on AFLW2000-3D [5], classifying yaw angles into small, medium, and large angles. As shown in the first four rows of Fig. 4, both the proposed method and other methods obtain reasonable results. In comparison, our method shows better robustness in the case of images with large yaw angles. When the yaw angle increases to 60\(^\circ \) or even more (the second column of the third and fourth rows of Fig. 4), the reconstructed 3D face shape of PRNet [6] is slightly distorted with asymmetry between the left and right of the face. In the fourth row, the mouth of the input face image is tightly closed, but the mouth cannot be closed in the reconstruction results of 3DDFA-V2 [8].

Fig. 4
figure 4

Visual comparison with different methods on AFLW2000-3D and CelebA. a Input images. b PRNet [6]. c 3DDFA-V2 [8]. d Our method

We then conduct experiments on images with different attributes from CelebA [43], and the results are shown in the last five rows of Fig. 4. Based on the annotated attributes on CelebA, we group the images by the following cases: blur, occlusion (coverage by sunglasses or bangs), insufficient light, and uneven light. For the case of blurry images, as seen in the fifth row of Fig. 4, the proposed method is not affected by the quality of images. Compared to 3DDFA-V2 [8] (the third result in the fifth row), our reconstructed 3D face shapes fit the input images better, especially the chin part. For images with facial occlusions, such as those with sunglasses or bangs, our method allows for reasonable reconstruction of the occluded parts, as shown in the last column of the sixth and seventh rows of Fig. 4. The performance of 3D-from-2D face reconstruction can also be easily affected when the input image is insufficiently or unevenly illuminated. However, it is clear that our method does not suffer from light problems. As shown in the last two rows of Fig. 4, the reconstructed 3D face shapes of the proposed method (the last column) achieve better results than both methods (the second and third columns).

In particular, our method can maintain a correct face shape even if the input image is in poor condition. In contrast to PRNet [6], our method reconstructs a more detailed and clearer contour of the 3D face shape, especially the area in and around the eyes and mouth. In comparison with the two state-of-the-art methods, our method shows better robustness in extreme cases.

Fig. 5
figure 5

More comparison of our method with 3DDFA-V2 [8] on CelebA. The facial regions around the mouth are magnified to show the results more distinctly. The results are shown for transparency of \( alpha = 0.5 \) and \( alpha = 0.8 \)

In comparison with 3DDFA-V2 [8], we conduct a more qualitative analysis, as shown in Fig. 5. We visualize the reconstructed 3D face models under the same conditions as 3DDFA-V2 [8] on CelebA [43]. Subtle changes are difficult to visualize, so we focus on the facial region of the mouth where the comparative results are more obvious. In the first example, it is obvious that our method is much more profound in portraying the expression of the mouth. In addition, the curvature of the cheek is better. In the second and third examples with our method, the mouths are more appropriately depicted, in contrast to the open mouths in 3DDFA-V2 [8]. Overall, we can conclude that our method is able to provide more suitable facial details than 3DDFA-V2 [8].

4.4.2 Quantitative analysis

For quantitative evaluation, we perform comparative experiments of 3D face reconstruction on the NoW [27] dataset and sparse face alignment on AFLW2000-3D [5].

Evaluation of 3D face reconstruction We compare our method with the state of the art by means of the NoW [27] benchmark. The results of the comparison are shown in Table 3 and the cumulative error plot in Fig. 6. As shown in Table 3, it is obvious that our method achieves smaller reconstruction errors than 3DMM-CNN [45], UMDFA [33], and PRNet [6], where the median error is reduced by 0.55 mm, 0.23 mm, and 0.21 mm, respectively. Figure 6 also shows that the cumulative error of our method is lower than those of the three methods. A conclusion can be drawn that reconstructing 3D face models by image-to-image methods is less effective. The performance of our reconstructed 3D shapes is slightly worse than that of Dib et al. [46], 3DDFA-V2 [8], Deng et al. [26], and RingNet [27], where the median error differs by only approximately 0.03 mm to 0.08 mm, and the cumulative error is slightly higher but close to these comparative approaches. The main reason is that most of them chose to regress more 3DMM parameters to obtain more accurate results, which means more time and memory consuming processes. Dib et al. [46], Deng et al. [26], and RingNet [27] all employed a deep residual network [47] with equal to or more than 50 layers. The network would run slowly or even fail to run on typical CPUs. In contrast, our method is based on a lightweight CNN, which supports fast reconstruction on a single CPU core. Compared to 3DDFA-V2 [8], both of us adopt MobileNet to increase the speed of reconstruction. However, we target different objects of training data. We consider the case where only a single image is available as a training input, while 3DDFA-V2 [8] proposed a 3D aided short-video-synthesis strategy to allow the method to be suitable for video data. Our method is therefore more challenging in the training phase.

Table 3 Reconstruction error of different methods on the NoW dataset and the dimensions of the 3DMM parameters regressed by each method
Fig. 6
figure 6

Cumulative error curves of different methods on the NoW dataset

Evaluation of sparse face alignment We evaluate the performance of different methods in sparse face alignment by comparing the NME of 68 landmarks on AFLW2000-3D [5], and the results are shown in Table 4. This shows that our method achieves relatively good performance in face alignment for images with large poses. Overall, the mean NME of our method is smaller than that of most methods, except for 3DDFA-V2 [8]. As shown in the last three rows of Table 4, compared with 3DDFA-V2 [8] trained without the short-video-synthesis strategy, our method obtains lower errors than theirs when yaw angles are greater than 30\(^\circ \) (3.46\(\%\) vs. 3.49\(\%\)/4.49\(\%\) vs. 4.53\(\%\)). When compared with the entire 3DDFA-V2 [8], our method differs by 0.13\(\%\) at yaw angles from 0\(^\circ \) to 30\(^\circ \) but by only 0.04\(\%\) and 0.01\(\%\) at yaw angles from 30\(^\circ \) to 60\(^\circ \) and 60\(^\circ \) to 90\(^\circ \), respectively. This indicates that the proposed method performs with greater robustness when training data are more challenging, as we train the network based on only single images.

Table 4 NME (\(\%\)) comparison of images with different yaw angles \(\psi \) on AFLW2000-3D

User study To measure the user preference for the reconstruction results generated by our method and 3DDFA-V2 [8], we conduct a user study. We first divide AFLW2000-3D [5] into three categories based on yaw angle ranges: small (0\(^\circ \) \(\le \) \(\psi \) \(\le \)30\(^\circ \)), medium (30\(^\circ \) \(\psi \) \(\le \)60\(^\circ \)), and large (60\(^\circ \) \(\psi \) \(\le \)90\(^\circ \)). An equal number of images are randomly selected from each category to be reconstructed using our method and 3DDFA-V2 [8] separately. To facilitate presentation and comparison, we then re-project the reconstructed models onto the corresponding input images. In total, 120 pairs of generated images are created and evenly distributed into 10 groups. In total, we have 135 participants, each choosing a group at random and answering 12 questions. For each question, participants should decide which generated image represents the input image more closely, or it is difficult to choose one or the other. The results show that on average our method is preferred when compared with 3DDFA-V2 [8] (39.56\(\%\) vs. 35.58\(\%\)). The remaining 24.37\(\%\) of participants think both methods are equally effective, while 0.49\(\%\) say neither is similar. It is generally accepted that our method is better for reconstruction of the mouth region. However, in some examples, although the reconstruction of the facial features by our method fits the input image better, the reconstruction of the facial contours is not as good as by 3DDFA-V2 [8].

4.4.3 Model complexity and running time

Since our method is based on a lightweight network, the number of parameters in our network is only 7.6 M, with an input of a 120 \(\times \) 120 size image and an output of 62-dimensional parameters. It is significantly less than other commonly used neural networks, such as ResNet-50, which has approximately 23.6 M parameters with the same input size and output size. It is worth mentioning that the multiply–accumulate operations (MACs) of our network are 298.0 M, compared with 6190 M of PRNet [6]. Table 5 shows the model sizes of different methods. We note that the model size of our method is approximately 29 MB, which is much smaller than most other methods. The model size of 3DDFA-V2 [8] is smaller than that of ours, because they only use MobileNet-V1, while we introduce an attention mechanism and a GCN. Nevertheless, our method enables fast reconstruction even when the number of parameters increases. As shown in Table 5, we compare the inference speed of our method and 3DDFA-V2 [8] on a personal laptop with an NVIDIA GeForce GTX 1650 Ti GPU and an AMD Ryzen 7 4800H with Radeon Graphics CPU @ 2.90 GHz. Our method takes 4.1 ms on a GPU or 20.7 ms on a CPU to regress 3DMM parameters, increasing by only 0.8 ms on the GPU and 6.3 ms on the CPU compared with 3DDFA-V2 [8].

Table 5 Model size and running time of different methods

5 Conclusion

In this study, we propose a learning-based method that aims to achieve fast and robust 3D face reconstruction from a single image. We combine the lightweight network, attention mechanism, and GCN, and demonstrate the performance of the combination in experiments. The method not only demonstrates the improvement in reconstruction speed and memory consumption but also guarantees the robustness of the reconstructed 3D models. In inference, only a single image is required as input, and no landmarks or other information is needed.

However, there are some limitations to our method. Since we cut a large number of 3DMM parameters, our method improves the speed of reconstruction but sacrifices the accuracy of the reconstructed model. If the parameter size is expanded, more refined 3D face models will be obtained. Furthermore, the albedo of the face and the illumination of the images have not been considered in our work. These are the steps we will take next. In the future, we will focus on reconstructing more detailed and realistic face models in a fast way, and try to put them into actual production, e.g. as an auxiliary tool for 3D animation or face recognition.