X2Face: A Network for Controlling Face Generation Using Images, Audio, and Pose Codes

Wiles, Olivia; Koepke, A. Sophia; Zisserman, Andrew

doi:10.1007/978-3-030-01261-8_41

Olivia Wiles¹⁷,
A. Sophia Koepke¹⁷ &
Andrew Zisserman¹⁷

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 11217))

Included in the following conference series:

European Conference on Computer Vision

3377 Accesses
198 Citations
6 Altmetric

Abstract

The objective of this paper is a neural network model that controls the pose and expression of a given face, using another face or modality (e.g. audio). This model can then be used for lightweight, sophisticated video and image editing.

We make the following three contributions. First, we introduce a network, X2Face, that can control a source face (specified by one or more frames) using another face in a driving frame to produce a generated frame with the identity of the source frame but the pose and expression of the face in the driving frame. Second, we propose a method for training the network fully self-supervised using a large collection of video data. Third, we show that the generation process can be driven by other modalities, such as audio or pose codes, without any further training of the network.

The generation results for driving a face with another face are compared to state-of-the-art self-supervised/supervised methods. We show that our approach is more robust than other methods, as it makes fewer assumptions about the input data. We also show examples of using our framework for video face editing.

O. Wiles and A. S. Koepke—Equally contributed.

You have full access to this open access chapter, Download conference paper PDF

Talking Face Video Generation with Editable Expression

Deep authoring - an AI Tool set for creating immersive MultiMedia experiences

Article 11 January 2021

You Said That?: Synthesising Talking Faces from Audio

Article Open access 13 February 2019

Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

Being able to animate a still image of a face in a controllable, lightweight manner has many applications in image editing/enhancement and interactive systems (e.g. animating an on-screen agent with natural human poses/expressions). This is a challenging task, as it requires representing the face (e.g. modelling in 3D) in order to control it and a method of map** the desired form of control (e.g. expression or pose) back onto the face representation. In this paper we investigate whether it is possible to forgo an explicit face representation and instead implicitly learn this in a self-supervised manner from a large collection of video data. Further, we investigate whether this implicit representation can then be used directly to control a face with another modality, such as audio or pose information.

To this end, we introduce X2Face, a novel self-supervised network architecture that can be used for face puppeteering of a source face given a driving vector. The source face is instantiated from a single or multiple source frames, which are extracted from the same face track. The driving vector may come from multiple modalities: a driving frame from the same or another video face track, pose information, or audio information; this is illustrated in Fig. 1. The generated frame resulting from X2Face has the identity, hairstyle, etc. of the source face but the properties of the driving vector (e.g. the given pose, if pose information is given; or the driving frame’s expression/pose, if a driving frame is given).

The network is trained in a self-supervised manner using pairs of source and driving frames. These frames are input to two subnetworks: the embedding network and the driving network (see Fig. 2). By controlling the information flow in the network architecture, the model learns to factorise the problem. The embedding network learns an embedded face representation for the source face – effectively face frontalisation; the driving network learns how to map from this embedded face representation to the generated frame via an embedding, named the driving vector.

The X2Face network architecture is described in Sect. 3.1, and the self-supervised training framework in Sect. 3.2. In addition we make two further contributions. First, we propose a method for linearly regressing from a set of labels (e.g. for head pose) or features (e.g. from audio) to the driving vector; this is described in Sect. 4. The performance is evaluated in Sect. 5, where we show (i) the robustness of the generated results compared to state-of-the-art self-supervised [45] and supervised [1] methods; and (ii) the controllability of the network using other modalities, such as audio or pose. The second contribution, described in Sect. 6, shows how the embedded face representation can be used for video face editing, e.g. adding facial decorations in the manner of [31] using multiple or just a single source frame.

2 Related Work

Explicit Modelling of Faces for Image Generation. Traditionally facial animation (or puppeteering) given one image was performed by fitting a 3DMM and then modifying the estimated parameters [3]. Later work has built on the fitting of 3DMMs by including high level details [34, 41], taking into account additional images [33] or 3D scans [4], or by learning 3DMM parameters directly from RGB data without ground truth labels [2, 39]. Please refer to Zollhöfer et al. [46] for a survey.

Given a driving and source video sequence, a 3DMM or 3D mesh can be obtained and used to model both the driving and source face [10, 40, 43]. The estimated 3D is used to transform the expression of the source face to match that of the driving face. However, this requires additional steps to transfer the hidden regions (e.g. the teeth). As a result, a neural network conditioned on a single driving image can be used to predict higher level details to fill in these hidden regions [25].

Motivated by the fact that a 3DMM approach is limited by the components of the corresponding morphable model, which may not model the full range of required expressions/deformations and the higher level details, [1] propose a 2D war** method. Given only one source image, [1] use facial landmarks in order to warp the expression of one face onto another. They additionally allow for fine scale details to be transferred by monitoring changes in the driving video.

An interesting related set of works consider how to frontalise a face in a still image using a generic reference face [14], transferring expressions of an actor to an avatar [35] and swap** one face with another [20, 24].

Learning Based Approaches for Image Generation. There is a wealth of literature on supervised/self-supervised approaches; here we review only the most relevant work. Supervised approaches for controlling a given face learn to model factors of variation (e.g. lighting, pose, etc.) by conditioning the generated image on known ground truth information which may be head pose, expression, or landmarks [5, 12, 21, 30, 42, 44]. This requires a training dataset with known pose or expression information which may be expensive to obtain or require subjective judgement (e.g. in determining the expression). Consequently, self-supervised and unsupervised approaches attempt to automatically learn the required factors of variation (e.g. optical flow or pose) without labelling. This can be done by maximising mutual information [7] or by training the network to synthesise future video frames [11, 29].

Another relevant self-supervised method is CycleGAN [45] which learns to transform images of one domain into those of another. While not explicitly devised for this task, as CycleGAN learns to be cycle-consistent, the transformed images often bear semantic similarities to the original images. For example, a CycleGAN model trained to transform images of one person’s face (domain A) into those of another (domain B), will often learn to map the pose/position/expression of the face in domain A onto the generated face from domain B.

Using Multi-modal Setups to Control Image Generation. Other modalities, such as audio, can control image generation by using a neural network that learns the relationship between audio and correlated parts in corresponding images. Examples are controlling the mouth with speech [8, 38], controlling a head with audio and a known emotional state [16], and controlling body movement with music [36].

Our method has the benefits of being self-supervised and the ability to control the generation process from other modalities without requiring explicit modelling of the face. Thus it is applicable to other domains.

3 Method

This section introduces the network architecture in Sect. 3.1, followed by the curriculum strategy used to train the network in Sect. 3.2.

3.1 Architecture

The network takes two inputs: a driving and a source frame. The source frame is input to the embedding network and the driving frame to the driving network. This is illustrated in Fig. 2. Precise architectural details are given in the supplementary material.

Embedding Network. The embedding network learns a bilinear sampler to determine how to map from the source frame to a face representation, the embedded face. The architecture is based on U-Net [32] and pix2pix [15]; the output is a 2-channel image (of the same dimensions as the source frame) that encodes the flow $\delta x, \delta y$ for each pixel.

While the embedding network is not explicitly forced to frontalise the source frame, we observe that it learns to do so for the following reason. Because the driving network samples from the embedded face to produce the generated frame without knowing the pose/expression of the source frame, it needs the embedded face to have a common representation (e.g. be frontalised) across source frames with differing poses and expressions.

Driving Network. The driving network takes a driving frame as input and learns a bilinear sampler to transform pixels from the embedded face to produce the generated frame. It has an encoder-decoder architecture. In order to sample correctly from the embedded face and produce the generated frame, the latent embedding (the driving vector) must encode pose/expression/zoom/other factors of variation.

3.2 Training the Network

The network is trained with a curriculum strategy using two stages. The first training stage (I) is fully self-supervised. In the second training stage (II), we make use of a CNN pre-trained for face identification to add additional constraints based on the identity of the faces in the source and driving frames to finetune the model following training stage (I).

I. The first stage (illustrated in Fig. 2) uses only a pixelwise L1 loss between the generated and the driving frames. Whilst this is sufficient to train the network such that the driving frame encodes expression and pose, we observe that some face shape information is leaked through the driving vector (e.g. the generated face becomes fatter/longer depending on the face in the driving frame). Consequently, we introduce additional loss functions – called identity loss functions – in the second stage.

II. In the second stage, the identity loss functions are applied to enforce that the identity is the same between the generated and the source frames irrespective of the identity of the driving frame. This loss should mitigate against the face shape leakage discussed in stage I. In practice, one source frame $s_A$ of identity A, and two driving frames $d_A$, $d_R$ are used as training inputs; $d_A$ is of identity A and $d_R$ a random identity. This gives two generated frames $g_{d_A}, g_{d_R}$ respectively, which should both be of identity A. Two identity loss functions are then imposed: $\mathcal {L}_{\text {identity}}(d_A, g_{d_A})$ and $\mathcal {L}_{\text {identity}}(s_A, g_{d_R})$. $\mathcal {L}_{\text {identity}}$ is implemented using a network pre-trained for identity to measure the similarity of the images in feature space by comparing appropriate layers of the network (i.e. a content loss as in [6, 13]). The precise layers are chosen based on whether we are considering $g_{d_A}$ or $g_{d_R}$:

1.
$\mathcal {L}_{\textit{identity}}(d_A, g_{d_A})$. $g_{d_A}$ should have the same identity, pose and expression as $d_A$ so we use the photometric L1 loss and a L1 content loss on the Conv2-5 and Conv7 layers (i.e. layers that encode both lower/higher level information such as pose/identity) between $g_{d_A}$ and $d_A$.
2.
$\mathcal {L}_{\textit{identity}}(s_A, g_{d_R})$ (Fig. 3). $g_{d_R}$ should have the identity of $s_A$ but the pose and expression of $d_R$. Consequently, we cannot use the photometric loss but only a content loss. We minimise a L1 content loss on the Conv6-7 layers (i.e. layers encoding higher level identity information) between $g_{d_A}$ and $s_A$.

The pre-trained network used for these losses is the 11-layer VGG network (configuration A) [37] trained on the VGG-Face Dataset [26].

4 Controlling the Image Generation with Other Modalities

Given a trained X2Face network, the driving vector can be used to control the source face with other modalities such as audio or pose.

4.1 Pose

Instead of controlling the generation with a driving frame, we can control the head pose of the source face using a pose code such that when varying the code’s pitch/yaw/roll angles, the generated frame varies accordingly. This is done by learning a forward map** $f_{p \rightarrow v}$ from head pose p to the driving vector v such that $f_{p \rightarrow v}(p)$ can serve as a modified input to the driving network’s decoder. However, this is an ill-posed problem; directly using this map** loses information, as the driving vector encodes more than just pose.

As a result, we use vector arithmetic. Effectively we drive a source frame with itself but modify the corresponding driving vector $v_{emb}^{source}$ to remove the pose of the source frame $p_{source}$ and incorporate the new driving pose $p_{driving}$. This gives:

$$\begin{aligned} v_{emb}^{driving} = v_{emb}^{source} + v_{emb}^{\varDelta pose} = v_{emb}^{source} + f_{p \rightarrow v}(p_{driving} - p_{source}). \end{aligned}$$

(1)

However, VoxCeleb [23] does not contain ground truth head pose, so an additional map** $f_{v \rightarrow p}$ is needed to determine $p_{source}=f_{v \rightarrow p}(v_{emb}^{source})$.

$f_{v \rightarrow p}$. $f_{v \rightarrow p}$ is trained to regress p from v. It is implemented using a fully connected layer with bias and trained using an L1 loss. Training pairs (v, p) are obtained using an annotated dataset with image to pose labels p; v is obtained by passing the image through the encoder of the driving network.

$f_{p \rightarrow v}$. $f_{p \rightarrow v}$ is trained to regress v from p. It is implemented using a fully-connected linear layer with bias followed by batch-norm. When $f_{v \rightarrow p}$ is known, this function can be learnt directly on VoxCeleb by passing an image through X2Face to get the driving vector v and $f_{v \rightarrow p}(v)$ gives the pose p.

4.2 Audio

Audio data from the videos in the VoxCeleb dataset can be used to drive a source face in a manner similar to that of pose by driving the source frame with itself but modifying the driving vector using the audio from another frame. The forward map** $f_{a \rightarrow v}$ from audio features a to the corresponding driving vector v is trained using pairs of audio features a and driving vectors v. These can be directly extracted from VoxCeleb (so no backward map** $f_{v \rightarrow a}$ is required). a is obtained by extracting the 256D audio features from the neural network in [9] and the 128D v by passing the corresponding frame through the driving network’s encoder. Ordinary least squares linear regression is then used to learn $f_{a \rightarrow v}$ after first normalising the audio features to . No normalisation is used when employing the map** to drive the frame generation; this amplifies the signal, visually improving the generated results.

As learning the function $f_{a \rightarrow v}: \mathbb {R}^{1\times 256} \rightarrow \mathbb {R}^{1\times 128}$ is under-constrained, the embedding learns to encode some pose information. Therefore, we additionally use the map**s $f_{p \rightarrow v}$ and $f_{v \rightarrow p}$ described in Sect. 4.1 to remove this information. Given driving audio features $a_{driving}$ and the corresponding, non-modified driving vector $v_{emb}^{source}$, the new driving vector $v_{emb}^{driving}$ is then

$$\begin{aligned} v_{emb}^{driving} = v_{emb}^{source} + f_{a \rightarrow v}(a_{driving}) - f_{a \rightarrow v}(a_{source}) + f_{p \rightarrow v}(p_{audio} - p_{source}), \end{aligned}$$

where $p_{source}=f_{v \rightarrow p}(v_{emb}^{source})$ is the head pose of the frame input to the driving network (i.e. the source frame), $p_{audio}=f_{v \rightarrow p}(f_{a \rightarrow v}(a_{driving}))$ is the pose information contained in $f_{a \rightarrow v}(a_{driving})$, and $a_{source}$ is the audio feature vector corresponding to the source frame.

5 Experiments

This section evaluates X2Face by first performing an ablation study in Sect. 5.1 on the architecture and losses used for training, followed by results for controlling a face with a driving frame in Sect. 5.2, pose information in Sect. 5.3, and audio information in Sect. 5.4.

Training. X2Face is trained on the VoxCeleb video dataset [23] using dlib [18] to crop the faces to $256\times 256$. The identities are randomly split into train/val/test identities (with a split of 75/15/10) and frames extracted at one fps to give 900,764 frames for training and 125,131 frames for testing.

The model is trained in PyTorch [27] using SGD with momentum 0.9 and batchsize of 16. First, it is trained just with L1 loss, and a learning rate of 0.001. The learning rate is decreased by a factor of 10 when the loss plateaus. Once the loss converges, the identity losses are incorporated and are weighted as follows: (i) for same identities to be as strong as the photometric L1 loss at each layer; (ii) for different identities to be 1/10 the size of the photometric loss at each layer. This training phase is started with a learning rate of 0.0001.

Testing. The model can be tested using either a single or multiple source frames. The reasoning for this is that if the embedded face is stable (e.g. different facial regions always map to the same place on the embedded face), we expect to be able to combine multiple source frames by averaging over the embedded faces.

5.1 Architecture Studies

To quantify the utility of using additional views at test time and the benefit of the curriculum strategy for training the network (i.e. using the identity losses explained in Sect. 3.2), we evaluate the results for these different settings on a left-out test set of VoxCeleb. We consider 120K source and driving pairs where the driving frame is from the same video as the source frames; thus, the generated frame should be the same as the driving frame. The results are given in Table 1.

Table 1. L1 reconstruction error on the test set, comparing the generated frame to the ground truth frame (in this case the driving frame) for different training/testing setups. Lower is better for L1 error. Additionally, we give the percentage improvement over the L1 error for the model trained with only training stage I and tested with a single source frame. In this case, higher is better

Full size table

The results in Table 1 confirm that both training with the curriculum strategy and using additional views at test time improve the reconstructed image. The supplementary material includes qualitative results and shows that using additional source frames when testing is especially useful if a face is seen at an extreme pose in the initial source frame.

5.2 Controlling Image Generation with a Driving Frame

The motivation of our architecture is to be able to map the expression and pose of a driving frame onto a source frame without any annotations on expression or pose. This section demonstrates that X2Face does indeed achieve this, as a set of source frames can be controlled with a driving video and generate realistic results. We compare to two methods: CycleGAN [45] which uses no labels and [1] which is designed top down and demonstrates impressive results. Additional qualitative results are given in the supplementary material and video.

Comparison to CycleGAN [45]. CycleGAN learns a map** from a given domain (in this case a given identity A) to another domain (in this case another identity B). To compare to their method for a given pair of identities, we take all images of the given identities (so images may come from different video tracks) to form two sets of images: one set corresponding to identity A and the other to B. We then train their model using these sets. To compare, for a given driving frame of identity A, we visualise their generated frame from identity B which is compared to that of X2Face.

The results in Fig. 4 illustrate multiple benefits. First, X2Face generalises to unseen pairs of identities at test time given only a source and driving frame. CycleGAN is trained on pairs of identities, so if there are too few example images, it fails to correctly model the shape and geometry of the source face, producing unrealistic results. Additionally, our results have better temporal coherence (i.e. consistent background/hair style/etc. across generated frames), as X2Face transforms a given frame whereas CycleGAN samples from a latent space.

Comparison to Averbuch-Elor et al. [1]. We compare to [1] in Fig. 5. There are two significant advantages of our formulation over theirs: first, we can handle more significant pose changes in the driving video and source frame (Fig. 5b, c). Second, ours has fewer assumptions: (1) [1] assumes that the first frame of the driving video is in a frontal pose with a neutral expression and that the source frame also has a neutral expression (Fig. 5d). (2) X2Face can be used when given a single driving frame whereas their method requires a video so that the face can be tracked and the tracking used to expand the number of correspondences and to obtain high level details.

While this is not the focus of this paper, our method can be augmented with the ideas from these methods. For example, as inspired by [1], we can perform simple post-processing to add higher level details (Fig. 5a, X2Face+p.p.) by transferring hidden regions using Poisson editing [28].

5.3 Controlling the Image Generation with Pose

Before reporting results on controlling the driving vector using pose, we validate our claim that the driving vector does indeed learn about pose. To do this, we evaluate how accurately we can predict the three head pose angles – yaw, pitch and roll – given the 128D driving vector.

Pose Predictor. To train the pose predictor which also serves as $f_{v \rightarrow p}$ (Sect. 4.1), the 25,993 images in the AFLW dataset [19] are split into train/val set, leaving out the 1,000 test images from [22] as test set. The results on the test set are reported in Table 2 confirming that the driving vector learns about head pose without having been trained on pose labels, as the results are comparable to those of a network directly trained for this task.

We then use $f_{v \rightarrow p}$ to train $f_{p \rightarrow v}$ (Sect. 4.1) and present generated frames for different, unseen test identities using the learnt map**s in Fig. 6. The source frame corresponds to $p_{source}$ in Sect. 4.1 while $p_{driving}$ is used to vary one head pose angle while kee** the others fixed.

Table 2. MAE in degrees using the driving vector for head pose regression (lower is better). Note that the linear pose predictor from the driving vector performs only slightly worse than a supervised method [22], which has been trained for this task

Full size table

5.4 Controlling the Image Generation with Audio Input

This section presents qualitative results for using audio data from videos in the VoxCeleb dataset to drive the source frames. The VoxCeleb dataset consists of videos of interviews, suggesting that the audio should be especially correlated with the movements of the mouth. [9]’s model, trained on the BBC-Oxford ‘Lip Reading in the Wild’ dataset (LRW), is used to extract audio features. We use the 256D vector activations of the last fully connected layer of the audio stream (FC7) for a 0.2s audio signal centred on the driving frame (the frame occurs half way through the 0.2s audio signal).

A potential source of error is the domain gap between the LRW dataset and VoxCeleb, as [9]’s model is not fine-tuned on the VoxCeleb dataset which contains much more background noise than the LRW dataset. Thus, their model has not necessarily learnt to become indifferent to this noise. However, our model is relatively robust to this problem; we observe that the mouth movements in the generated frames are reasonably close to what we would expect from the sounds of the corresponding audio, as demonstrated in Fig. 7. This is true even if the person in the video is not speaking and instead the audio is coming from an interviewer. However, there is some jitter in the generation.

6 Using the Embedded Face for Video Editing

We consider how the embedded face can be used for video editing. This idea is inspired by the concept of an unwrapped mosaic [31]. We expect the embedded face to be pose and expression invariant, as can be seen qualitatively across the example embedded faces shown in the paper. Therefore, the embedded face can be considered as a UV texture map of the face and drawn on directly.

This task is executed as follows. A source frame (or set of source frames) is extracted and input to the embedding network to obtain the embedded face. The embedded face can then be drawn on using an image or other interactive tool. A video is reconstructed using the modified embedded face which is driven by a set of driving frames. Because the embedded face is stable across different identities, a given edit can be applied across different identities. Example edits are shown in Fig. 8 and in the supplementary material.

7 Conclusion

We have presented a self-supervised framework X2Face for driving face generation using another face. This framework makes no assumptions about the pose, expression, or identity of the input images, so it is more robust to unconstrained settings (e.g. an unseen identity). The framework can also be used with minimal alteration post training to drive a face using audio or head pose information. Finally, the trained model can be used as a video editing tool. Our model has achieved all this without requiring annotations for head pose/facial landmarks/depth data. Instead, it is trained self-supervised on a large collection of videos and learns itself to model the different factors of variation.

While our method is robust, versatile, and allows for generation to be conditioned on other modalities, the generation quality is not as high as approaches specifically designed for transforming faces (e.g. [1, 17, 40]). This opens an interesting avenue of research: how can the approach be modified such that the versatility, robustness, and self-supervision aspects are retained but with the generation quality of these methods that are specifically designed for faces. Finally, as no assumptions have been made that the videos are of faces, it is interesting to consider applying our approach to other domains.

References

Averbuch-Elor, H., Cohen-Or, D., Kopf, J., Cohen, M.F.: Bringing portraits to life. ACM Trans. Graph. (Proceeding of SIGGRAPH Asia 2017) 36(6), 196 (2017)
Google Scholar
Bas, A., Smith, W.A.P., Awais, M., Kittler, J.: 3D morphable models as spatial transformer networks. In: Proceedings of ICCV Workshop on Geometry Meets Deep Learning (2017)
Google Scholar
Blanz, V., Vetter, T.: A morphable model for the synthesis of 3D faces. In: Proceedings of ACM SIGGRAPH (1999)
Google Scholar
Booth, J., Roussos, A., Ponniah, A., Dunaway, D., Zafeiriou, S.: Large scale 3D morphable models. IJCV 126(2–4), 233–254 (2018)
Article MathSciNet Google Scholar
Cao, J., Hu, Y., Yu, B., He, R., Sun, Z.: Load balanced GANs for multi-view face image synthesis. ar**v preprint ar**v:1802.07447 (2018)
Chen, Q., Koltun, V.: Photographic image synthesis with cascaded refinement networks. In: Proceedings of ICCV (2017)
Google Scholar
Chen, X., Duan, Y., Houthooft, R., Schulman, J., Sutskever, I., Abbeel, P.: Infogan: interpretable representation learning by information maximizing generative adversarial nets. In: NIPS (2016)
Google Scholar
Chung, J.S., Senior, A., Vinyals, O., Zisserman, A.: Lip reading sentences in the wild. In: Proceedings of CVPR (2017)
Google Scholar
Chung, J.S., Zisserman, A.: Out of time: automated lip sync in the wild. In: Chen, C.-S., Lu, J., Ma, K.-K. (eds.) ACCV 2016. LNCS, vol. 10117, pp. 251–263. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-54427-4_19
Chapter Google Scholar
Dale, K., Sunkavalli, K., Johnson, M.K., Vlasic, D., Matusik, W., Pfister, H.: Video face replacement. ACM Trans. Graph. (TOG) 30(6), 130 (2011)
Article Google Scholar
Denton, E.L., Birodkar, V.: Unsupervised learning of disentangled representations from video. In: NIPS (2017)
Google Scholar
Ding, H., Sricharan, K., Chellappa, R.: ExprGAN: facial expression editing with controllable expression intensity. In: Proceedings of AAAI (2018)
Google Scholar
Gatys, L.A., Ecker, A.S., Bethge, M.: Image style transfer using convolutional neural networks. In: Proceedings of CVPR (2016)
Google Scholar
Hassner, T., Harel, S., Paz, E., Enbar, R.: Effective face frontalization in unconstrained images. In: Proceedings of CVPR (2015)
Google Scholar
Isola, P., Zhu, J.Y., Zhou, T., Efros, A.A.: Image-to-image translation with conditional adversarial networks. In: Proceedings of CVPR (2017)
Google Scholar
Karras, T., Aila, T., Laine, S., Herva, A., Lehtinen, J.: Audio-driven facial animation by joint end-to-end learning of pose and emotion. ACM Trans. Graph. (TOG) 36(4), 94 (2017)
Article Google Scholar
Kim, H., et al.: Deep video portraits. In: Proceedings of ACM SIGGRAPH (2018)
Google Scholar
King, D.E.: Dlib-ml: a machine learning toolkit. J. Mach. Learn. Res. 10, 1755–1758 (2009)
Google Scholar
Koestinger, M., Wohlhart, P., Roth, P.M., Bischof, H.: Annotated facial landmarks in the wild: a large-scale, real-world database for facial landmark localization. In: Proceedings of First IEEE International Workshop on Benchmarking Facial Image Analysis Technologies (2011)
Google Scholar
Korshunova, I., Shi, W., Dambre, J., Theis, L.: Fast face-swap using convolutional neural networks. In: Proceedings of ICCV (2017)
Google Scholar
Kulkarni, T.D., Whitney, W.F., Kohli, P., Tenenbaum, J.: Deep convolutional inverse graphics network. In: NIPS (2015)
Google Scholar
Kumar, A., Alavi, A., Chellappa, R.: KEPLER: keypoint and pose estimation of unconstrained faces by learning efficient H-CNN regressors. In: Proceedings of the International Conference on Automatic Face and Gesture Recognition (2017)
Google Scholar
Nagrani, A., Chung, J.S., Zisserman, A.: VoxCeleb: a large-scale speaker identification dataset. In: INTERSPEECH (2017)
Google Scholar
Nirkin, Y., Masi, I., Tran, A.T., Hassner, T., Medioni, G.: On face segmentation, face swap**, and face perception. In: Proceedings of International Conference on Automatic Face and Gesture Recognition (2018)
Google Scholar
Olszewski, K., et al.: Realistic dynamic facial textures from a single image using GANs. In: Proceedings of ICCV (2017)
Google Scholar
Parkhi, O.M., Vedaldi, A., Zisserman, A.: Deep face recognition. In: Proceedings of BMVC (2015)
Google Scholar
Paszke, A., et al.: Automatic differentiation in PyTorch (2017)
Google Scholar
Pérez, P., Gangnet, M., Blake, A.: Poisson image editing. ACM Trans. Graph. (TOG) 22(3), 313–318 (2003)
Article Google Scholar
Pătrăucean, V., Handa, A., Cipolla, R.: Spatio-temporal video autoencoder with differentiable memory. In: NIPS (2016)
Google Scholar
Qiao, F., Yao, N., Jiao, Z., Li, Z., Chen, H., Wang, H.: Geometry-contrastive generative adversarial network for facial expression synthesis. ar**v preprint ar**v:1802.01822 (2018)
Rav-Acha, A., Kohli, P., Rother, C., Fitzgibbon, A.: Unwrap mosaics: a new representation for video editing. ACM Trans. Graph. (TOG) 27(3), 17 (2008)
Article Google Scholar
Ronneberger, O., Fischer, P., Brox, T.: U-Net: convolutional networks for biomedical image segmentation. In: Navab, N., Hornegger, J., Wells, W.M., Frangi, A.F. (eds.) MICCAI 2015. LNCS, vol. 9351, pp. 234–241. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-24574-4_28
Chapter Google Scholar
Roth, J., Tong, Y., Liu, X.: Adaptive 3D face reconstruction from unconstrained photo collections. In: Proceedings of CVPR (2016)
Google Scholar
Saito, S., Wei, L., Hu, L., Nagano, K., Li, H.: Photorealistic facial texture inference using deep neural networks. In: Proceedings of CVPR (2017)
Google Scholar
Saragih, J.M., Lucey, S., Cohn, J.F.: Real-time avatar animation from a single image. In: Proceedings of International Conference on Automatic Face and Gesture Recognition (2011)
Google Scholar
Shlizerman, E., Dery, L., Schoen, H., Kemelmacher-Shlizerman, I.: Audio to body dynamics. In: Proceedings of CVPR (2018)
Google Scholar
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. In: International Conference on Learning Representations (2015)
Google Scholar
Suwajanakorn, S., Seitz, S.M., Kemelmacher-Shlizerman, I.: Synthesizing Obama: learning lip sync from audio. ACM Trans. Graph. (TOG) 36(4), 95 (2017)
Article Google Scholar
Tewari, A., et al.: Mofa: model-based deep convolutional face autoencoder for unsupervised monocular reconstruction. In: Proceedings of ICCV (2017)
Google Scholar
Thies, J., Zollhöfer, M., Stamminger, M., Theobalt, C., Nießner, M.: Face2Face: real-time face capture and reenactment of RGB videos. In: Proceedings of CVPR (2016)
Google Scholar
Tran, A.T., Hassner, T., Masi, I., Paz, E., Nirkin, Y., Medioni, G.: Extreme 3D face reconstruction: Seeing through occlusions. In: Proceedings of CVPR (2018)
Google Scholar
Tran, L., Yin, X., Liu, X.: Disentangled representation learning GAN for pose-invariant face recognition. In: Proceedings of CVPR (2017)
Google Scholar
Vlasic, D., Brand, M., Pfister, H., Popović, J.: Face transfer with multilinear models. ACM Trans. Graph. (TOG) 24(3), 426–433 (2005)
Article Google Scholar
Worrall, D.E., Garbin, S.J., Turmukhambetov, D., Brostow, G.J.: Interpretable transformations with encoder-decoder networks. In: Proceedings of ICCV (2017)
Google Scholar
Zhu, J.Y., Park, T., Isola, P., Efros, A.A.: Unpaired image-to-image translation using cycle-consistent adversarial networks. In: Proceedings of ICCV (2017)
Google Scholar
Zollhöfer, M., Thies, J., Garrido, P., Bradley, D., Beeler, T., Pérez, P., Stamminger, M., Nießner, M., Theobalt, C.: State of the art on monocular 3D face reconstruction, tracking, and applications. In: Proceedings of Eurographics (2018)
Google Scholar

Download references

Acknowledgements

The authors are grateful to Hadar Averbuch-Elor for helpfully running their model on our data and to Vicky Kalogeiton for suggestions/comments. This work was funded by an EPSRC studentship and EPSRC Programme Grant Seebibyte EP/M013774/1.

Author information

Authors and Affiliations

Visual Geometry Group, University of Oxford, Oxford, UK
Olivia Wiles, A. Sophia Koepke & Andrew Zisserman

Authors

Olivia Wiles
View author publications
You can also search for this author in PubMed Google Scholar
A. Sophia Koepke
View author publications
You can also search for this author in PubMed Google Scholar
Andrew Zisserman
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Olivia Wiles .

Editor information

Editors and Affiliations

Google Research, Zurich, Switzerland
Vittorio Ferrari
Carnegie Mellon University, Pittsburgh, PA, USA
Martial Hebert
Google Research, Zurich, Switzerland
Cristian Sminchisescu
Hebrew University of Jerusalem, Jerusalem, Israel
Yair Weiss

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (avi 23473 KB)

Supplementary material 2 (mp4 51240 KB)

Supplementary material 3 (pdf 20257 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Wiles, O., Koepke, A.S., Zisserman, A. (2018). X2Face: A Network for Controlling Face Generation Using Images, Audio, and Pose Codes. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds) Computer Vision – ECCV 2018. ECCV 2018. Lecture Notes in Computer Science(), vol 11217. Springer, Cham. https://doi.org/10.1007/978-3-030-01261-8_41

Download citation

DOI: https://doi.org/10.1007/978-3-030-01261-8_41
Published: 06 October 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-01260-1
Online ISBN: 978-3-030-01261-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics