Introduction

Facial animation generation is a highly challenging research problem with several applications such as facial animation for digital humans, computer games, movies, or immersive VR telepresence [32]. In these application, facial animation quality must exhibit a high level of naturalness and plausibility, ensuring intelligibility on par with actual human speakers. The human visual system is evolutionarily attuned to perceive nuanced facial movements and expressions; consequently, animations lacking natural expressions or synchronization with lip movements can be distressing for viewers.

In recent years, deep learning methods have made significant progress in various application areas [26]. Currently, speech-driven methods can generate realistic lip-synced 3D facial animations by training on 4D facial audiovisual datasets. However, speech-driven 3D facial animation ignores the generation of facial posture and expressions due to the weak correlation between speech and facial expressions, as well as head posture. This limitation is further exacerbated by the scarcity of 4D facial datasets, leading to static upper face animation [10, 12, 25]. Although some methods [32] can generate random eye-blinking or eyebrow motion when trained on high-precision datasets that are not publicly available, or transmit the three common emotions (happiness, anger, surprise) to 3D faces through an emotion transmission network [38], they lack deeper control over expression and posture. Existing studies have shown that it is difficult to obtain lip-sync, natural expression and controllable posture in 3D facial animation from a single speech modality alone.

The task of video-driven 3D facial animation is similar to that of single-image-based 3D face reconstruction, aiming to convert 2D images into 3D models. In recent years, significant progress has been made in using image data for single-image-based 3D face reconstruction. The most advanced single-image-based 3D face reconstruction technology, DECA [13], can predict camera, lighting, shape, texture, expression and posture parameters from a single photo fairly accurately. However, if the 3D face reconstruction technology is directly applied to video-driven tasks, the reconstructed mouth shape and movement often have serious artifacts in human perception, making it challenging to capture the lip-shape perception movement corresponding to speech. To improve the above mentioned limitations of 3D face reconstruction technology, EMOCA [11] added an expression network on top of DECA to regress expression parameters. Using a well-trained emotion recognition model, it calculated the emotion consistency loss (also known as emotion perceptual loss) based on the predicted results of the expression network, which could generate better expressions. EMOCA is a network model trained on single-image data. When it is applied to video data to generate 3D facial animations, there are still issues such as non-smooth animation effects and mismatch between mouth shape and speech. The state-of-the-art method, SPECTRE [14], improved DECA and EMOCA by modifying EMOCA’s expression network to a perceptual encoding network, which predicts both the facial expression and chin posture parameters. At the same time, they added a well-trained state-of-the-art lip-reading model to add lip-reading consistency loss (also known as speech perceptual loss) to the predicted results of the perceptual encoding network. With this approach, the reconstructed face has more accurate mouth movements, and when combined with the corresponding speech, it can produce more realistic effects. Video-based (performance-driven) 3D facial animation methods can conveniently and accurately generate expressions and postures from visual modality information, but compared to speech-driven 3D facial animation trained on 4D facial audiovisual datasets, it still has a natural disadvantage in lip shape perception.

Therefore, it is of great research significance to combine the speech-driven 3D facial lip animation trained on the 4D face audiovisual dataset and the video-driven 3D facial animation trained on 2D videos, while retaining the advantages of both methods, in order to obtain lip-synced, naturally expressive and pose-controllable 3D facial animation. This paper proposes a dual-modal generation method that uses speech and video information to generate more natural and vivid 3D facial animation, focusing on the mouth area and speech-uncorrelated expressions and postures. The main contributions of this paper are as follows:

  1. (a)

    Building an additional expression and pose network based on a speech-driven network trained on the 4D face dataset. The speech feature is extracted using the speech-driven network to generate basic lip animation, while the expression and pose network extracts temporal visual features to regress facial expression and head pose parameters. By integrating the speech and visual modal features, the related chin posture parameters associated with lip movements are obtained. These parameters are subsequently used to fine-tune the lip animation generated by the speech-driven network.

  2. (b)

    Designing a new video frame preprocessing algorithm that uniformly crops all frames in the video. This makes it easier for the expression and pose network to learn the temporal information between different frames and the transformation information of the face in the same background. The effectiveness of the preprocessing algorithm is verified through experiments, which improves the precision of the network model predictions.

  3. (c)

    Designing a "head pose consistency" loss to guide the network to reconstruct more accurate head poses and solve prediction errors in extreme head pose situations.

  4. (d)

    Conducting extensive objective and subjective (user research) evaluations to prove the superiority of our method. The effectiveness of each part of the method is verified through ablation experiments.

The rest of this paper proceeds as follows: in “Related work”, we provide a comprehensive review of previous studies related to speech-driven, video-driven and speech-video driven 3D facial animation. In “Design of algorithm”, we present the details of the proposed framework and provide a detailed illustration of its implementation. In “Experiments”, we demonstrate the performance of our new method, describe the experimental settings used, and present our experimental results. In “Ablation study”, we present the ablation experimental setting and our results. The conclusion and future work are given in “Conclusions”.

Related work

The generation of 3D facial animations has always been a challenging problem that has garnered significant attention in the fields of computer graphics and computer vision, leading to extensive research. Based on the different driving methods, facial animation can be categorized into text-driven [42], speech-driven [10, 12, 25, 32, 43], video-driven [11, 13, 14] and speech-video combined driven [8, 21] animations.

In the following, we will introduce the methods for generating 3D facial animations driven by speech, video and both speech and video that are most relevant to this paper.

Speech-driven 3D facial animation

In the fields of computer graphics and computer vision, speech-driven 3D facial animation aims to generate lip animation matching speech input by driving a 3D facial model. In recent years, speech-driven 3D facial animation based on deep learning has been extensively researched. For example, Richard et al. [31] use full speech-driven method to achieve real-time and realistic facial animation, but it is personalized and relies on hours of training data from a single subject. Taylor et al. [35] proposed a sliding window method to input phoneme sequences that need to be transcribed from audio, and used redirecting techniques to redirect outputs to other animation platforms. Karras et al. [22] design a end-to-end convolutional neural network to encode speech and use a latent code to eliminate ambiguity in facial expression changes. However, this model has low fidelity lip synchronization and facial expressions, and cannot be generalized to new character objects. Zhou et al. [43] adopted a three-stage network combining phoneme clusters, facial landmarks and speech features to predict visual phoneme animation curves.

Tian et al. [36] proposed a method based on deep bidirectional LSTM networks and attention mechanisms to map input speech features to cartoon facial animation parameters. However, their map** from speech to the face may not preserve the identity and personality of the target speaker, especially when dealing with new speakers or sentences.

Cudeiro et al. [10] created the 4D face dataset VOCASET and proposed an independent 3D face animation method called VOCA. The model extracts speech character features using the DeepSpeech network [15] uses Baum-Welch HMM inversion instead of the commonly used Viterbi decoding, resulting in more accurate animation control. However, the HMM only allows a single hidden state to occupy each time range, meaning that many states are needed to simulate multimodal signals rather than capturing the complexity of cross-modal dynamics. ** model for each phoneme. Furthermore, it enhances the quality of the generated facial animation by integrating both speech and video information. However, this method requires scanning the 3D face model to create a map** from phonemes to blend shapes. Hussen et al. [21] proposed a neural network-based method to use audio-visual data to drive 3D facial animation. The neural network extracts audio embedding from speech spectrogram features and visual embedding from facial images. After fusing speech and visual embeddings, it regresses to speech-related facial controls using affine layers, while non-linguistic facial controls and head posture are inferred only from visual embeddings. The disadvantage of this method is that it considers the temporal information of speech but ignores the temporal information of video and overlooks the rich dynamic information of video.

Design of algorithm

Fig. 1
figure 1

Preprocessing workflow.The video preprocessing process includes face landmark detection, face crop** and normalization

Modeling

As discussed above, our goal is to generate lip-synced, naturally expressive and pose-controllable 3D facial animation. Let A be a raw audio data and \({\textbf{F}}_{1:T} = ({\textbf{f}}_1,\ldots , {\textbf{f}}_T)\) be a video data, where T is the number of video frames. A pre-trained speech-driven network takes raw audio data A as input and outputs predicted 3D facial lip offsets \(\tilde{{\textbf{Y}}}_{1:T^{\prime }}=(\tilde{{\textbf{y}}}_1, \ldots , \tilde{{\textbf{y}}}_{T^{\prime }})\), and the speech encoder in the network outputs speech features \({\textbf{W}}_{1:T^{\prime }} =({\textbf{w}}_1,\ldots , {\textbf{w}}_{T^{\prime }})\), where \(T^{\prime }\) represents the number of frames output by the speech-driven network. In this paper, we construct an expression pose module, whose backbone network extracts visual features \(\mathbf {V^{\prime }}_{1:T} = (\mathbf {v^{\prime }}_1,\ldots , \mathbf {v^{\prime }}_T)\) from video data \({\textbf{F}}_{1:T}\), and a temporal convolutional network further extracts temporal visual features \({\textbf{V}}_{1:T} = ({\textbf{v}}_1,\ldots , {\textbf{v}}_T)\). The temporal visual features \({\textbf{V}}_{1:T}\) are used to regress only visually related expression parameters \(\psi \) and head pose parameters \(\theta _p\). By aligning and fusing the temporal visual features \({\textbf{V}}_{1:T}\) and speech features \({\textbf{W}}_{1:T^{\prime }}\), chin pose parameters \(\theta _j\) related to lip shape are regressed. Here, \(\theta _j\) is considered as a fine-tuning of the lip animation \(\tilde{{\textbf{Y}}}_{1:T^{\prime }}\) generated by the speech-driven network, and is constrained by L2 regularization. The expression parameters \(\psi \), head pose parameters \(\theta _p\) and chin pose parameters \(\theta _j\) are decoded by FLAME [24] face to obtain 3D facial offset points \(\overline{{\textbf{Y}}}_{1:T}=(\overline{{\textbf{y}}}_1, \ldots , \overline{{\textbf{y}}}_{T})\). Resampled \(\tilde{{\textbf{Y}}}_{1:T^{\prime }}\) and \(\overline{{\textbf{Y}}}_{1:T}\) are aligned and added to the 3D face template to obtain lip-synchronized, pose-controlled and naturally expressive 3D facial animation.

Preprocessing

To obtain inputs suitable for the expression pose network, it is necessary to preprocess the video frames. The preprocessing workflow, as shown in Fig. 1, includes 2D facial landmark detection, face crop** and normalization.

The preprocessing method for video frames is as shown in Algorithm 1. Firstly, 68 facial landmarks are detected for all video frames using the open-source FAN [5] face alignment network. Secondly, face crop** is performed on all video frames using an improved face bounding box crop** algorithm that considers the entire video sequence. By calculating the maximum crop** range based on the edge position of facial landmarks in all frames, the crop** box can cover all faces in the frame sequence, making it easier for the temporal convolutional network to learn the temporal information between different frames and the transformation information of faces under the same background. We validated the effectiveness of this pre-processing algorithm in the subsequent experiments, which showed improved accuracy of the network model prediction. To obtain a better coverage of the entire head region, random scaling is applied to the original face crop** box. The scaling factor is typically set between 1.4 to 1.6 times the original size, and a random scaling factor can improve the generalization ability of the network. Specifically, the center coordinate and original size of the face crop** box are calculated, and the new size is obtained by multiplying the size by the scaling factor. Then, a new face crop** box is calculated based on the center coordinate and new size, and the face is cropped into a fixed size of \(s * s\). Finally, the cropped face images are normalized by converting pixel values from 0–255 to 0–1 to accelerate the convergence of the neural network.

Algorithm 1
figure a

Video frame preprocessing algorithm.

Fig. 2
figure 2

System overview. The network model described consists of a fixed speech-driven network and a facial reconstruction network, with an additional expression pose network built to better predict expression parameters and pose parameters

Network architecture

The system overview of the speech-video joint-driven 3D face animation generation method proposed in this paper is shown in Fig. 2. The method utilized a pre-trained DECA [13] network and a speech-driven network as a foundation for training our own network. This approach enabled us to take advantage of the existing capabilities of DECA, while also integrating our own enhancements. With a fixed speech-driven network and face reconstruction network, an additional expression pose network is built to participate in the training and better predict the expression and pose parameters of the 3D face model. The speech-driven network takes audio as input and outputs the lip offsets of the 3D face, which serves as the basis for 3D face animation. The speech features generated by the encoder of this network are fed into the expression pose network for feature fusion. The expression pose network uses the fused dual-modal features to predict expression and pose parameters, adding expressions and poses to the basic animation generated by the speech-driven network and fine-tuning the lip animation. The shape parameters predicted by the 3D face reconstruction network will be combined with the expression and pose parameters predicted by the expression pose network during training, reconstructed into a 3D face using the FLAME face decoder, and geometrically constrained using facial landmarks. Finally, the predicted 3D face model sequence is rendered into a 2D video using differentiable rendering, combining the predicted texture parameters and camera transformation parameters predicted by the 3D face reconstruction network. Certainly, it should be pointed out that the overall structure depends on other pre-trained network models, and the accuracy of these other network models has an impact on the accuracy of the facial expression and posture network.

Inspired by EMOCA [11] and SPECTRE [14], we introduced pre-trained emotion recognition, pose estimation and lip reading networks to calculate the expression consistency loss, pose consistency loss and lip reading consistency loss (also known as perceptual losses) of the expression pose network. The emotion recognition network comes from a pre-trained model provided by EMOCA, the pose estimation network comes from a pre-trained model provided by Hempel et al. [20], and the lip reading network comes from a pre-trained model provided by Ma et al. [28]. By separately inputting the original video sequence and the rendered video of the predicted 3D face animation into the pre-trained perception models mentioned above, corresponding feature vectors can be obtained. In theory, the rendered video and the original videos should be consistent in expression, pose and lip shape. Therefore, by minimizing the distance between the feature vectors of the rendered videos and the original videos, we can optimize the output of the expression pose network.

Fig. 3
figure 3

The network architecture of our model. The input for the model includes both speech and video data. The fused speech and video features are used to regress the chin pose parameters related to lip shape, while the visual features alone are used to regress the expression parameters and head pose parameters

The detailed structure of the network model is shown in the following Fig. 3. The input part includes the speech input A and the video input \({\textbf{F}}_{1:T} = ({\textbf{f}}_1,\ldots , {\textbf{f}}_T)\), where T is the number of video frames. For the original speech data A, it is fed into the speech-driven network, and the encoder of the network outputs the speech feature \({\textbf{W}}_{1:T^{\prime }} =({\textbf{w}}_1,\ldots , {\textbf{w}}_T^{\prime })\). The predicted 3D face vertex offset points are represented by \(\tilde{{\textbf{Y}}}_{1:T^{\prime }}=(\tilde{{\textbf{y}}}_1, \ldots , \tilde{{\textbf{y}}}_{T^{\prime }})\), where \(T^{\prime }\) represents the number of frames output by the speech-driven network. It should be noted that these 3D face vertex offset points are only related to lip movements, i.e., they are irrelevant to expression and head pose. In order to align with the video frames, we resample \({\textbf{W}}_{1:T^{\prime }}\) using linear interpolation to obtain \({\textbf{W}}_{1:T}=({\textbf{w}}_1,\ldots , {\textbf{w}}_T)\), and resample \(\tilde{{\textbf{Y}}}_{1:T^{\prime }}\) to obtain \(\tilde{{\textbf{Y}}}_{1:T}=(\tilde{{\textbf{y}}}_1, \ldots , \tilde{{\textbf{y}}}_{T})\).

The video data \({\textbf{F}}_{1:T}\) is fed into the expression and pose network, which uses the MobileNet v2 model as the backbone network to extract visual features \(\mathbf {V^{\prime }}_{1:T} = (\mathbf {v^{\prime }}_1,\ldots , \mathbf {v^{\prime }}_T)\) of the frames in the video. In order to use the temporal information between video frames and learn rich dynamic information from the video, a one-dimensional convolutional neural network with a kernel size of 5, stride of 1, and padding of 2 is used to build the temporal convolution layer of the expression and pose network, which can further extract temporal visual features \({\textbf{V}}_{1:T} = ({\textbf{v}}_1,\ldots , {\textbf{v}}_T)\) from the visual features \(\mathbf {V^{\prime }}_{1:T}\). Then we fuse the speech feature \({\textbf{W}}_{1:T}\) and the temporal visual feature \({\textbf{V}}_{1:T}\) together using a simple concatenation method. The fused features are then fed through two fully connected layers to regress the chin pose parameters \(\theta _j\) related to lip movements. The temporal visual feature \({\textbf{V}}_{1:T}\) is fed through a fully connected layer to regress the expression parameters \(\psi \) and head pose parameters \(\theta _p\) only related to the visual information. Here, \(\theta _j\) can be considered as fine-tuning for \(\tilde{{\textbf{Y}}}_{1:T}\), constrained by L2 regularization. Assuming that without considering expressions and head poses, the true 3D face lip vertex offset points corresponding to the video data are \({\textbf{Y}}_{1:T}=({\textbf{Y}}_1,\ldots , {\textbf{Y}}_T)\), then the output \(\tilde{{\textbf{Y}}}_{1:T}\) of the speech-driven network is close to \({\textbf{Y}}_{1:T}\), but with some error, namely \(\Delta {{\textbf{Y}}}_{1:T} = {\textbf{Y}}_{1:T} - \tilde{{\textbf{Y}}}_{1:T}\). This error is influenced by the speaking styles of different people (differences in lip opening) and the accuracy of the speech-driven network. We expect that the chin pose parameters \(\theta _j\) can be used to obtain the 3D face vertex offset points \(\Delta \tilde{{\textbf{Y}}}_{1:T}\) through FLAME face decoding, which is close to \(\Delta {{\textbf{Y}}}_{1:T}\), that is, we expect \({\textbf{Y}}_{1:T} = \tilde{{\textbf{Y}}}_{1:T} + \Delta \tilde{{\textbf{Y}}}_{1:T}\), indicating that the predicted chin pose parameters \(\theta _j\) learn the residual between the predicted results of the speech-driven network and the true results. At the same time, considering that this error is very small, L2 regularization is used to constrain the value of the chin pose parameters \(\theta _j\). Finally, the expression parameters \(\psi \), head pose parameters \(\theta _p\) and chin pose parameters \(\theta _j\) are used to decode a 3D face vertex offset point \(\overline{{\textbf{Y}}}_{1:T}=(\overline{{\textbf{y}}}_1, \ldots , \overline{{\textbf{y}}}_{T})\) containing expression and pose information using the FLAME face decoding method.

During training, this paper combines the predictions of various parameters from the 3D face reconstruction network known as DECA. These predictions are utilized to generate a 3D face model that aligns with the shape, posture and expression exhibited by the video face. This alignment is achieved through the FLAME face decoding method. The expression and posture network are optimized using geometric constraint loss and multiple consistency losses. However, due to the limitation of GPU memory during training, sending all frames of the video into the network at the same time would exceed the CUDA memory capacity. Therefore, only a continuous sampling of K frames from the video is used for each training iteration, and the starting position of the sample is random. During inference, adding \(\tilde{{\textbf{Y}}}_{1:T}\) and \(\overline{{\textbf{Y}}}_{1:T}\) to the 3D face template \(\overline{{\textbf{T}}}\) together can yield a lip-synced, pose-controllable and naturally expressive 3D face animation.

Loss function

To train the facial expression and pose network, we employ three consistency loss functions and a geometric constraint loss to guide the network to reconstruct 3D facial animations of expressions and poses.

Consistency losses

The expression and pose parameters, which are predicted by the expression-pose network, are combined with the shape, albedo, camera and light parameters generated by the 3D face reconstruction network. Additionally, the 3D face lip shape offsets, predicted by the speech-driven network, are utilized. These parameters collectively contribute to rendering 2D videos representing the anticipated 3D facial animation sequences that correspond to the original input video. As mentioned earlier, we introduce three pre-trained task-specific models that use the original video and rendered video to obtain their respective feature vectors. By minimizing the distance between the feature vectors of the rendered video and the original video, we can better guide the expression-pose network to generate 3D facial animation. The three consistency loss functions will be introduced below.

Emotion consistency loss The emotion recognition network is based on the pre-trained model provided by EMOCA [11], with ResNet50 as the backbone network. The original video and rendered video are fed into the pre-trained emotion recognition network, obtaining the emotion features \(\varvec{\epsilon }_V=E(V)\) of each frame of the original video and the emotion features \(\varvec{\epsilon }_{R}=E(V_{R})\) of each frame of the rendered video. The emotion consistency loss calculates the difference between the emotion features \(\varvec{\epsilon }_V\) and \(\varvec{\epsilon }_{R}\) using mean squared error (MSE), as shown in the following equation:

$$\begin{aligned} L_{e m o} = \left\| \varvec{\epsilon }_V-\varvec{\epsilon }_{R}\right\| ^2 \end{aligned}$$
(1)

\(L_{emo}\) measures the perceptual difference between each frame of the original video and the rendered video, rather than the geometric error. Optimizing this loss during training ensures that the reconstructed 3D face conveys the emotional content of the input video.

Lip-reading consistency loss The emotion consistency loss does not retain sufficient lip shape information for the mouth area, and geometric loss using 2D landmarks cannot guarantee accurate mouth lip shape motion. Therefore, an additional lip-reading consistency loss related to the mouth area is needed to guide the network output expression and jaw pose parameters to capture the complexity of the mouth lip shape motion. The lip-reading estimation network is a pre-trained model provided by Ma et al. [28], which takes cropped grayscale images around the mouth as input sequences and outputs predicted character sequences. In the lip-reading consistency loss, only the lip-reading features extracted during the intermediate process of the lip-reading estimation network are used, rather than the predicted character sequences. Thus, the cropped sequences around the mouth of the original video and the rendered video are fed into the pre-trained lip-reading estimation network, obtaining the lip-reading features \(\varvec{\xi }_V=L(V)\) of each frame of the original video and the lip-reading features \(\varvec{\xi }_{R}=L(V_{R})\) of each frame of the rendered video. The lip-reading consistency loss uses mean squared error (MSE) to minimize the difference between the lip-reading features \(\varvec{\xi }_V\) and \(\varvec{\xi }_{R}\), as shown in the following equation:

$$\begin{aligned} L_{lip} = \left\| \varvec{\xi }_V-\varvec{\xi }_{R}\right\| ^2 \end{aligned}$$
(2)

Pose consistency loss Emotion consistency loss and lip-reading consistency loss do not involve the optimization of head pose. Therefore, an additional pose consistency loss is introduced to ensure that the network can reconstruct the head pose of the original video, especially in extreme head pose situations. Currently, various methods for pose estimation have achieved good results. For example, a hand pose estimation model is proposed that consists of automatic labeling and classification based on a deep convolutional neural network (DCNN) structure by Qi et al. [30]. In our experiment, The estimation network is a pre-trained model provided by Hempel et al. [20], which takes the original video and rendered video as input and obtains the pose feature vectors \(\varvec{\eta }_V=P(V)\) of each frame of the original video and the pose feature vectors \(\varvec{\eta }_{R}=P(V_{R})\) of each frame of the rendered video. The pose consistency loss uses mean squared error (MSE) to calculate the difference between the pose features \(\varvec{\eta }_V\) and \(\varvec{\eta }_{R}\), as shown in the following equation:

$$\begin{aligned} L_{pose} = \left\| \varvec{\eta }_V-\varvec{\eta }_{R}\right\| ^2 \end{aligned}$$
(3)

Geometric loss

Although consistency loss helps to preserve high-level perceptual information, in some cases the model tends to create artifacts due to domain mismatch between rendered images and original images [14]. As the consistency loss relies on a pre-trained task-specific CNN network, which cannot guarantee that the input rendered images correspond to real images. Therefore, it is necessary to guide network training by using geometric constraints.

We use L2 norm penalty to punish the difference between the expression \(\varvec{\psi }\) and head pose \(\varvec{\theta }_p\) parameters predicted by our facial expression and pose network and those predicted by the DECA 3D face reconstruction network.

$$\begin{aligned} L_{\psi } = \left\| \varvec{\psi }-\varvec{\psi }^{D E C A}\right\| ^2 \end{aligned}$$
(4)
$$\begin{aligned} L_{\theta _p} = \left\| \varvec{\theta }_p-\varvec{\theta }_p^{D E C A}\right\| ^2 \end{aligned}$$
(5)

The above regularization term uses the estimation of DECA as a “good” starting point. In a sense, the prediction of the expression pose network should not deviate significantly from the DECA parameters, which have been proven to produce artifact-free results in practice. In other words, this regularization scheme indirectly imposes some constraints that are hard-coded by DECA and its training process. In addition, the chin pose parameter \(\varvec{\theta }_j\) is considered as a fine-tuning of lip animation results for the speech-driven network, and thus is also constrained using the L2 norm.

$$\begin{aligned} L_{\theta _j} = \left\| \varvec{\theta }_j\right\| ^2 \end{aligned}$$
(6)

In addition to the above-mentioned regularization loss, we also apply an L1 loss between 48 facial feature points of the nose, facial contour and eyes in the 3D face model and the 2D facial feature points in the video frame.

$$\begin{aligned} L_{l m k}=\sum _{i=1}^{48}\left\| (k_i-s\Pi (M_i) * w_i)\right\| \end{aligned}$$
(7)

In this equation, \(k_i\) represents the 2D facial feature points in the video frame, \(s\Pi (M_i)\) is the 2D facial feature points projected from 3D facial feature points, and \(w_i\) represents the weight corresponding to the feature point. For the 20 facial feature points in the mouth area, we apply a more relaxed L2 loss:

$$\begin{aligned} L_{lip\_l m k}=\sum _{i=49}^{68}\left\| k_i-s\Pi (M_i)\right\| ^2 \end{aligned}$$
(8)

To make the generated 3D facial animation frames smoother, a velocity loss is used to calculate the distance between the differences of consecutive frames 2D facial landmarks between predicted outputs and training videos. The calculation formula is as follows:

$$\begin{aligned} L_v=\sum _{j=2}^{T} \left\| (k_j-k_{j-1})-(s \Pi (M_j)-s \Pi (M_{j-1}))\right\| ^2 \end{aligned}$$
(9)

Experiments

Implementation details

Our network model was implemented using PyTorch and trained on a NVIDIA GeForce GTX 3080 Ti GPU. The Adam optimizer was used with an initial learning rate of 5e-5, which was reduced by a factor of 5 after 50,000 iterations. We used a video sampling sequence length of K=20 and a batchsize of 1.

Dataset

Our qualitative and perseptual evaluations followed a similar evaluation procedure as SPECTRE [14]. To conduct the evaluations, we used the following datasets:

LRS3 [Footnote 1

As shown in Table 1, we report the qualitative evaluation results of different methods on the LRS3 test set, TCD-TIMIT test set and MEAD test set, i.e., four metrics: CER, WER, VER and VWER. RGB represents the lip-reading prediction results from the original videos in the dataset, while DECA [13], 3DDFAv2 [17] and DAD3DHeads [29] are experimental data provided by SPECTRE [14], and SPECTRE [14] and EMOCA [11] are the results of official pretrained models tested under the same environment as ours.

Compared with other methods, our method achieved lower CER, WER, VER and VWER error rates on the LRS3 test set as well as on TCD-TIMIT and MEAD. However, compared to the original video, the qualitative evaluation results for CER, WER, VER and VWER are still much higher. This is because the rendered images are from a different domain than the real images, and important features such as teeth and tongue are missing from the 3D facial model. Teeth and tongue play a significant role in detecting specific types of phonemes or visemes (such as dental and labial consonants). Overall, our proposed method achieved the best lip-reading performance among all methods.

Table 1 Comparison of qualitative evaluation results for different methods

Perseptual evaluation

Fig. 4
figure 4

Visualization comparison of different method. The videos are arranged from top to bottom in the order of a original video, b EMOCA method, c SPECTRE method, and d our method. The red box highlights the poor performance of lip shape, while the green box highlights the good performance of lip shape

Qualitative evaluation methods mainly assess the accuracy of lip shapes, while lacking evaluation on the realism and lip syncing effect of 3D facial animation. To evaluate the realism and lip syncing effect of the generated 3D facial animation, we conducted a user study following the perseptual evaluation in SPECTRE [14] and Speech-Driven Animation [12, 32]. To prevent any potential dataset bias, we exclusively evaluated our method using videos from the MEAD and TCD-TIMIT datasets. We randomly selected 30 videos (21 from 7 emotions and 3 intensities, and 9 neutral videos) from the MEAD dataset and 10 videos from the TCD-TIMIT dataset. For these 40 videos, we generated corresponding 3D facial animations using our proposed method, SPECTRE and EMOCA.

We designed two subtasks for this user study. The first subtask is a comparison of overall realism of the 3D facial animation. The original video, our generated 3D facial animation, and the comparison method’s 3D facial animation are played side by side, with the playback position of our method and the comparison method randomized. Participants need to choose the more realistic or closer-to-the-original-video 3D facial animation, or they can choose both if they cannot distinguish. The second subtask is a comparison of lip syncing, where only our generated 3D facial animation and the comparison method’s 3D facial animation are played side by side, with their left-right positions randomized. Participants need to focus on the lip shapes and choose the 3D facial animation with more lip syncing, or they can choose both if they cannot distinguish.

This user study invited 14 adult participants with normal cognitive abilities and good English proficiency, and the final results are reported in Table 2. Partial visual comparisons of different methods are also shown in Fig. 4.

Table 2 User study results on realism and lip sync

In comparison with EMOCA, our approach has been ranked better or equal in more than 77% of the cases for realism and lip syncing is ranked better or equal than EMOCA in 79% of the cases. When comparing our approach to SPECTRE, over 70% and 66% of the cases respectively rank our proposed method better or equal overall realism and lip syncing than SPECTRE. This user study indicates that our speech-video joint driven method outperformed previous video-based methods and achieved better results.

Ablation study

Ablation on preprocessing and audio feature

In order to explore the impact of video frame preprocessing algorithms, the speech feature and lip animation results of speech-driven networks on expression posture networks, we divided each part into six variants by considering one or more component parts:

  1. (a)

    The baseline method: the previous video frame preprocessing method is used, that is, each frame is cropped separately. There is no audio signal used as input.

  2. (b)

    Based on (a), we use our video frame preprocessing method, which involves uniformly crop** consecutive frames of the video.

  3. (c)

    Adding the speech feature of the speech network as input to the expression posture network based on (a) and performing feature fusion.

  4. (d)

    Using lip animation predicted by speech-driven network as the basis on (a).

  5. (e)

    Including (c) and (d) on the basis of (a), that is, adding the speech feature of the speech network as input to the expression posture network for feature fusion. Also using lip animation predicted by speech-driven network as the basis.

  6. (f)

    Our final method including (b), (c) and (d).

The qualitative evaluation results of various situations on the LRS3 dataset are reported in Table 3. It can be found that our video frame preprocessing and speech features, as well as lip animation from the speech network all contribute to certain improvement in expression posture. Among them, the video frame preprocessing has the best effect on VER reduction, which can surpass the combination of speech feature and lip animation. In addition, it also has good effects on WER and VWER, exceeding that of speech feature or lip animation alone.

Table 3 Ablation study on LRS3 test set
Fig. 5
figure 5

Comparison of lip-animation-based approaches with and without incorporating speech features: a Raw video, b Variant d, c Variant e

In practice, as shown in Fig. 5, we found that in variant (d), if only lip animation results driven by speech are added without fusing speech features, it will cause unnatural results on the MEAD dataset, although the corresponding four lip reading indicators have been improved. However, in variant (e), the problem is effectively solved by fusing speech features, while further improving the lip reading indicators. This may be because the expression posture network without speech features lacks sufficient information to accurately guess the lip animation predicted by the speech-driven network, resulting in unnatural results during fusion. This indicates that fusing speech features can improve the generalization ability of the expression posture network.

Ablation on pose consistency loss

To validate the effectiveness of the introduced pose consistency loss, this paper trained networks with and without pose consistency loss, respectively, and compared the results of EMOCA and SPECTRE networks, as shown in Fig. 6. By observing the third column, it can be found that the expression posture network with pose consistency loss can correctly predict the head posture parameters, while EMOCA, SPECTRE and expression posture networks without pose consistency loss all predicted incorrect head posture parameters. This is because these methods are all based on DECA to predict head posture parameters, but DECA cannot make correct predictions for extreme head postures. By observing the fourth column, all methods made incorrect predictions for more extreme head postures. The main reason for the expression posture network with pose consistency loss to make incorrect predictions is due to the incorrect predictions made by the head posture estimation network. This indicates the limitations of the head posture consistency loss, which is limited by the head posture estimation network, and the dataset may lack video frames with extreme head postures.

Fig. 6
figure 6

Comparison of head pose generated by different methods: a Raw video, b Head pose estimation network, c EMOCA, d SPECTRE, e Without pose consistency loss, f With pose consistency loss. The red boxes indicate frames where there are issues with head pose estimation, while the orange boxes indicate artifacts around the mouth. Our method uses Mean Squared Error (MSE) to calculate lip synchronization loss, which reduces the likelihood of mouth artifacts

In conclusion, compared with other methods, by introducing the head posture consistency loss, the prediction accuracy of expression posture network for some extreme head postures has been improved. However, due to the limitations of the head posture estimation network and the scarcity of extreme posture data in the dataset, accurately predicting extremely extreme head poses can still be difficult.

Conclusions

This paper proposes a dual-modal generation method that utilizes both speech and video information to generate more natural and vivid 3D facial animations. This method builds an additional expression-posture network that uses video input and speech feature input on the basis of the speech-driven method and conducts experiments on publicly available 2D face datasets. Additionally, we have designed a new video frame preprocessing algorithm that uniformly crops all frames within a video for consistency. To further enhance the network’s capability to generate accurate expressions and postures, we introduce three novel loss functions: expression consistency loss, lip-reading consistency loss and posture consistency loss. Qualitative and perceptual evaluation experiments demonstrate that our proposed method, which is driven by both speech and video, outperforms other existing models in generating 3D facial animations.

Currently, this paper utilizes the state-of-the-art 3D facial statistical model FLAME [24]. However, the facial model obtained through principal component analysis (PCA) fails to capture finer facial expressions. In the future, improvements can be made on the 3D facial model to enhance its nonlinearity and refine the expression manipulation capabilities. Additionally, the current 3D facial model lacks details such as teeth and hair. Exploring automated generation methods for these facial details is an area that can be further improved in the future. Integrating these improvements into the existing framework and achieving real-time performance presents a significant challenge.